Notes on a New Philosophy of Empirical Science

04/28/2011 ∙ by Daniel Burfoot, et al. ∙ 0

This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation. The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Philosophical Foundations of Empirical Science

In a remarkable paper published in 1964, a biophysicist named John Platt pointed out the somewhat impolitic fact that some scientific fields made progress much more rapidly than others [89]*. Platt cited particle physics and molecular biology as exemplar fields in which progress was especially rapid. To illustrate this speed he relates the following anecdote:

[Particle physicists asked the question]: Do the fundamental particles conserve mirror-symmetry or “parity” in certain reactions, or do they not? The crucial experiments were suggested: within a few months they were done, and conservation of parity was found to be excluded. Richard Garwin, Leon Lederman, and Marcel Weinrich did one of the crucial experiments. It was thought of one evening at suppertime: by midnight they had arranged the apparatus for it; and by 4 am they had picked up the predicted pulses showing the non-conservation of parity.

Platt attributed this rapid progress not to the superior intelligence of particle physicists and molecular biologists, but to the fact that they used a more rigorous scientific methodology, which he called Strong Inference. In Platt’s view, the key requirement of rapid science is the ability to rapidly generate new theories, test them, and discard those that prove to be incompatible with evidence.

Many observers of fields such as artificial intelligence (AI), computer vision, computational linguistics, and machine learning will agree that, in spite of the journalistic hype surrounding them, these fields do not make rapid progress. Research in artificial intelligence was begun over 50 years ago. In spite of the bold pronouncement made at the time, the field has failed to transform society. Robots do not walk the streets; intelligent systems are generally brittle and function only within narrow domains. This lack of progress is illustrated by a comment by Marvin Minsky, one of the founders of AI, in reference to David Marr, one of the founders of computer vision:

After [David Marr] joined us, our team became the most famous vision group in the world, but the one with the fewest results. His idea was a disaster. The edge finders they have now using his theories, as far as I can see, are slightly worse than the ones we had just before taking him on. We’ve lost twenty years ([26], pg. 189).

This book argues that the lack of progress in artificial intelligence and related fields is caused by philophical limitations, not by technical ones. Researchers in these fields have no scientific methodology of power comparable to Platt’s concept of Strong Inference. They do not rapidly generate, test, and discard theories in the way that particle physicists do. This kind of critique has been uttered before, and would hardly justify a book-length exposition. Rather, the purpose of this book is to propose a scientific method that can be used, at least, for computer vision and computational linguistics, and probably for several other fields as well.

To set the stage for the proposal it is necessary to briefly examine the unique intellectual content of the scientific method. This uniqueness can be highlighted by comparing it to a theory of physics such as quantum mechanics. While quantum mechanics often seems mysterious and perplexing to beginning students, the scientific method appears obvious and inevitable. Physicists are constantly testing, examining, and searching for failures of quantum mechanics. The scientific method itself receives no comparable interrogation. Physicists are quite confident that quantum mechanics is wrong in some subtle way: one of their great goals is to find a unified theory that reconciles the conflicting predictions made by quantum mechanics and general relativity. In contrast, it is not even clear what it would mean for the scientific method to be wrong.

But consider the following chain of causation: the scientific method allowed humans to discover physics, physics allowed humans to develop technology, and technology allowed humans to reshape the world. The fact that the scientific method succeeds must reveal some abstract truth about the nature of reality. Put another way, the scientific method depends implicitly on some assertions or propositions, and because those assertions happen to be true, the method works. But what is the content of those assertions? Can they be examined, modified, or generalized?

This chapter begins with an attempt to analyze and document the assertions and philosophical commitments upon which the scientific method depends. Then, a series of thought experiments illustrate how a slight change to one of the statements results in a modified version of the method. This new version is based on large scale lossless data compression, and it uses large databases instead of experimental observation as the necessary empirical ingredient. The remainder of the chapter argues that the new method retains all the crucial characteristics of the original. The significance of the new method is that allows researchers to conduct investigations into aspects of empirical reality that have never before been systematically interrogated. For example, Chapter 3 that attempting to compress a database of natural images results in a field very similar to computer vision. Similarly, attempting to compress large text databases results in a field very similar to computational linguistics. The starting point in the development is a consideration of one of the most critical components of science: objectivity.

1.1.1 Objectivity, Irrationality, and Progress

The history of humanity clearly indicates that humans are prone to dangerous flights of irrationality. Psychologists have shown that humans suffer from a wide range of cognitive blind spots, with names like Scope Insensitivity and Availability Bias [56]. One special aspect of human irrationality of particular relevance to science is the human propensity to enshrine theories, abstractions, and explanations without sufficient evidence. Often, once a person decides that a certain theory is true, he begins to use that theory to interpret all new evidence. This distortative effect prevents him from seeing the flaws in the theory itself. Thus Ptolemy believed that the Sun rotated around the Earth, while Aristotle believed that all matter could be decomposed into the elements of fire, air, water, or earth.

Individual human fallibility is not the only obstacle to intellectual progress; another powerful barrier is group irrationality. Humans are fundamentally social creatures; no individual acting alone could ever obtain substantial knowledge about the world. Instead, humans must rely on a division of labor in which knowledge-acquisition tasks are delegated to groups of dedicated specialists. This division of labor is replicated even within the scientific community: physicists rely extensively on the experimental and theoretical work of other physicists. But groups are vulnerable to an additional set of perception-distorting effects involving issues such as status, signalling, politics, conformity pressure, and pluralistic ignorance. A low-ranking individual in a large group cannot comfortably disagree with the statements of a high-ranking individual, even if the former has truth on his side. Furthermore, scientists are naturally competitive and skeptical. A scientist proposing a new result must be prepared to defend it against inevitable criticism.

To overcome the problems of individual irrationality and group irrationality, a single principle is tremendously important: the principle of objectivity. Objectivity requires that new results be validated by mechanistic procedure that cannot be influenced by individual perceptions or sociopolitical effects. While humans may implement the validation procedure, it must be somehow independent of the particular oddities of the human mind. In the language of computer science, the procedure must be like an abstract algorithm, that does not depend on the particular architecture of the machine it is running on. The validation procedure helps to prevent individual irrationality, by requiring scientists to hammer their ideas against a hard anvil. It also protects against group irrationality, by providing scientists with a strong shield against criticism and pressure from the group.

The objectivity principle is also an important requirement for a field to make progress. Researchers in all fields love to publish papers. If a field lacks an objective validation procedure, it is difficult to prevent people from publishing low quality papers that contain incorrect results or meaningless observations. The so-called hard sciences such as mathematics, physics, and engineering employ highly objective evaluation procedures, which facilitates rapid progress. Fields such as psychology, economics, and medical science rely on statistical methods to validate their results. These methods are less rigorous, and this leads to significant problems in these fields, as illustrated by a recent paper entitled “Why most published research findings are false” [52]. Nonscientific fields such as literature and history rely on the qualitative judgments of practitioners for the purposes of evaluation. These examples illustrate a striking correlation between the objectivity of a field’s evaluation methods and the degree of progress it achieves.

1.1.2 Validation Methods and Taxonomy of Scientific Activity

The idea of objectivity, and the mechanism by which various fields achieve objectivity, can be used to define a useful taxonomy of scientific fields. Scientific activity, broadly considered, can be categorized into three parts: mathematics, empirical science, and engineering. These activities intersect at many levels, and often a single individual will make contributions in more than one area. But the three categories produce very distinct kinds of results, and utilize different mechanisms to validate the results and thereby achieve objectivity.

Mathematicians see the goal of their efforts as the discovery of new theorems. A theorem is fundamentally a statement of implication: if a certain set of assumptions are true, then some derived conclusion must hold. The legitimate mechanism for demonstrating the validity of a new result is a proof. Proofs can be examined by other mathematicians and verified to be correct, and this process provides the field with its objective validation mechanism. It is worthwhile to note that practical utility plays no essential role in the validation process. Mathematicians may hope that their results are useful to others, but this is not a requirement for a theorem to be considered correct.

Engineers, in contrast, take as their basic goal the development of practical devices. A device is a set of interoperating components that produce some useful effect. The word “useful” may be broadly interpreted: sometimes the utility of a new device may be speculative, or it may be useful only as a subcomponent of a larger device. Either way, the requirement for proclaiming success in engineering is a demonstration that the device works. It is very difficult to game this process: if the new airplane fails to take off or the new microprocessor fails to multiply numbers correctly, it is obvious that these results are low-quality. Thus, this public demonstration process provides engineering with its method of objective validation.

The third category, and the focus of this book, is empirical science. Empirical scientists attempt to obtain theories of natural phenomena. A theory is a tool that enables the scientist to make predictions regarding a phenomenon. The value and quality of a theory depends entirely on how well it can predict the phenomenon to which it applies. Empirical scientists are similar to mathematicians in the purist attitude they take toward the product of their research: they may hope that a new theory will have practical applications, but this is not a requirement.

Mathematics and engineering both have long histories. Mathematics dates back at least to 500 BC, when Pythagoras proved the theorem that bears his name. Engineering is even older; perhaps the first engineers were the men who fashioned axes and spearheads out of flint and thus ushered in the Stone Age. Systematic empirical science, in contrast, started only relatively recently, building on the work of thinkers like Galileo, Descartes, and Bacon. It is worth asking why the ancient philosophers in civilizations like Greece, Babylon, India, and China, in spite of their general intellectual advancement, did not begin a systematic empirical investigation of various natural phenomena.

The delay could have been caused by the fact that, for a long time, no one realized that there could be, or needed to be, an area of intellectual inquiry that was distinct from mathematics and engineering. Even today, it is difficult for nonscientists to appreciate the difference between a statement of mathematics and a statement of empirical science. After all, physical laws are almost always expressed in mathematical terms. What is the difference between the Newtonian statement and the Pythagorean theorem

? These statements, though they are expressed in a similar form, are in fact completely different constructs: one is an empirical theory, the other is a mathematical law. Several heuristics can be used to differentiate between the two types of statement. One good technique is to ask if the statement could be invalidated by some new observation or evidence. One could draw a misshapen triangle that did not obey the Pythagorean theorem, but that would hardly mean anything about the truth of the theorem. In contrast, there are observations that could invalidate Newton’s laws, and in fact such observations were made as a result of Einstein’s theory of relativity. There are, in turn, also observations that could disprove relativity.

Ancient thinkers might also have failed to see how it could be meaningful to make statements about the world that were not essentially connected to the development of practical devices. An ancient might very well have believed that it was impossible or meaningless to find a unique optimal theory of gravity and mass. Instead, scientists should develop a toolbox of methods for treating these phenomena. Engineers should then select a tool that is well-suited to the task at hand. So an engineer might very well utilize one theory of gravity to design a bridge, and then use some other theory when designing a catapult. In this mindset, theories can only be evaluated by incorporating them into some practical device and then testing the device.

Empirical science is also unique in that it depends on a special process for obtaining new results. This process is called the scientific method; there is no analogous procedure in mathematics or engineering. Without the scientific method, empirical scientists cannot do much more than make catalogs of disconnected and uninterpretable observations. When equipped with the method, scientists begin to discern the structure and meaning of the observational data. But as explained in the next section, the scientific method is only obvious in hindsight. It is built upon deep philosophical commitments that would have seemed bizarre to an ancient thinker.

1.1.3 Toward a Scientific Method

To understand the philosophical commitments implicit in empirical science, and to see why those commitments were nonobvious to the ancients, it is helpful to look at some other plausible scientific procedures. To do so, it is convenient to introduce the following simplified abstract description of the goal of scientific reasoning. Let be an experimental configuration, and be the experimental outcome. The variables and should be thought of not as numbers but as large packets of information including descriptions of various objects and quantities. The goal of science is to find a function that predicts the outcome of the configuration: .

A first approach to this problem, which can be called the pure theoretical approach, is to deduce the form of using logic alone. In this view, scientists should use the same mechanism for proving their statements that mathematicians use. Here there is no need to check the results of a prediction against the experimental outcome. Just as it is meaningless to check the Pythagorean theorem by drawing triangles and measuring its sides, it is meaningless to check the function against the actual outcomes . Mathematicians can achieve perfect confidence in their theories without making any kind of appeal to experimental validation, so why shouldn’t scientists be able to reason the same way? If Euclid can prove, based on purely logical and conceptual considerations, that the sum of the angles of a triangle adds up to 180 degrees, why cannot Aristotle use analogous considerations to conclude that all matter is composed of the four classical elements? A subtle critic of this approach might point out that mathematicians require the use of axioms, from which they deduce their results, and it is not clear what statements can play this role in the investigation of real-world phenomena. But even this criticism can be answered; perhaps the existence of human reason is the only necessary axiom, or perhaps the axioms can be found in religious texts. Even if someone had proposed to check a prediction against the actual outcome, it is not at all clear what this means or how to go about doing it. What would it mean to check Aristotle’s theory of the four elements? The ancients must have viewed the crisp proof-based validation method of mathematics as far more rigorous and intellectually satisfying than the tedious, error prone, and conceptually murky process of observation and prediction-checking.

At the other extreme from the pure theoretical approach is the strategy of searching for using a purely experimental investigation of various phenomena. The plan here would be to conduct a large number of experiments, and compile the results into an enormous almanac. Then to make a prediction in a given situation, one simply looks up a similar situation in the almanac, and uses the recorded value. For example, one might want to predict whether a brige will collapse under a certain weight. Then one simply looks up the section marked “bridges” in the almanac, finds the bridge in the almanac that is most similar to the one in question, and notes how much weight it could bear. In other words, the researchers obtain a large number of data samples and define as an enormous lookup table. The pure experimental approach has an obvious drawback: it is immensely labor-intensive. The researchers given the task of compiling the section on bridges must construct several different kinds of bridges, and pile them up with weight until they collapse. Bridge building is not easy work, and the almanac section on bridges is only one among many. The pure experimental approach may also be inaccurate, if the almanac includes only a few examples relating to a certain topic.

Obviously, neither the pure theoretical approach nor the pure experimental approach is very practical. The great insight of empirical science is that one can effectively combine experimental and theoretical investigation in the following way. First, a set of experiments corresponding to configurations are performed, leading to outcomes . The difference between this process and the pure experimental approach is that here the number of tested configurations is much smaller. Then, in the theoretical phase, one attempts to find a function that agrees with all of the data: for all . If such a function is found, and it is in some sense simple, then one concludes that it will generalize and make correct predictions when applied to new configurations that have not yet been tested.

This description of the scientific process should produce a healthy dose of sympathy for the ancient thinkers who failed to discover it. The idea of generalization, which is totally essential to the entire process, is completely nonobvious and raises a number of nearly intractable philosophical issues. The hybrid process assumes the existence of a finite number of observations , but claims to produce a universal predictive rule

. Under what circumstances is this legitimate? Philosophers have been grappling with this question, called the Problem of Induction, since the time of David Hume. Also, a moment’s reflection indicates that the problem considered in the theoretical phase does not have a unique solution. If the observed data set is finite, then there will be a large number of functions

that agree with it. These functions must make the same predictions for the known data , but may make very different predictions for other configurations.

1.1.4 Occam’s Razor

William of Occam famously articulated the principle that bears his name with the Latin phrase: entia non sunt multiplicanda praeter necessitatem; entities must not be multiplied without necessity. In plainer English, this means that if a theory is adequate to explain a body of observations, then one should not add gratuitous embellishments or clauses to it. To wield Occam’s Razor means to take a theory and cut away all of the inessential parts until only the core idea remains.

Scientists use Occam’s Razor to deal with the problem of theory degeneracy mentioned above. Given a finite set of experimental configurations and corresponding observed outcomes , there will always be an infinite number of functions that agree with all the observations. The number of compatible theories is infinite because one can always produce a new theory by adding a new clause or qualification to a previous theory. For example, one theory might be expressed in English as “General relativity holds everywhere in space”. This theory agrees with all known experimental data. But one could then produce a new theory that says “General relativity holds everywhere in space except in the Alpha Centauri solar system, where Newton’s laws hold.” Since it is quite difficult to show the superiority of the theory of relativity over Newtonian mechanics even in our local solar system, it is probably almost impossible to show that relativity holds in some other, far-off star system. Furthermore, an impious philosopher could generate an effectively infinite number of variant theories of this kind, simply by replacing “Alpha Centauri” with the name of some other star. This produces a vast number of conflicting accounts of physical reality, each with about the same degree of empirical evidence.

Scientists use Occam’s Razor to deal with this kind of crisis by justifying the disqualification of the variant theories mentioned above. Each of the variants has a gratuitous subclause, that specifies a special region of space where relativity does not hold. The subclause does not improve the theory’s descriptive accuracy; the theory would still agree with all observational data if it were removed. Thus, the basic theory that relativity holds everywhere stands out as the simplest theory that agrees with all the evidence. Occam’s Razor instructs us to accept the basic theory as the current champion, and only revise it if some new contradictory evidence arrives.

This idea sounds attractive in the abstract, but raises a thorny philosophical problem when put into practice. Formally, the razor requires one to construct a functional that rates the complexity of a theory. Then given a set of theories all of which agree with the empirical data, the champion theory is simply the least complex member of :

The problem is: how does one obtain the complexity functional ? Given two candidate definitions for the functional, how does one decide which is superior? It may very well be that complexity is in the eye of the beholder, and that two observers can legitimately disagree about which of two theories is more complex. This disagreement would, in turn, cause them to disagree about which member of a set of candidate theories should be considered the champion on the basis of the currently available evidence. This kind of disagreement appears to undermine the objectivity of science. Fortunately, in practice, the issue is not insurmountable. Informal measures of theory complexity, such as the number of words required to describe a theory in English, seem to work well enough. Most scientists would agree that “relativity holds everywhere” is simpler than “relativity holds everywhere except around Alpha Centauri”. If a disagreement persists, then the disputants can, in most cases, settle the issue by running an actual experiment.

1.1.5 Problem of Demarcation and Falsifiability Principle

The great philosopher of science Karl Popper proposed a principle called falsifiability that substantially clarified the meaning and justification of scientific theorizing [92]*. Popper was motivated by a desire to rid the world of pseudosciences such as astrology and alchemy. The problem with this goal is that astrologers and alchemists may very well appear to be doing real science, especially to laypeople. Astrologers may employ mathematics, and alchemists may utilize much of the same equipment as chemists. Some people who promote creationist or religiously inspired accounts of the origins of life make plausible sounding arguments and appear to be following the rules of logical inference. These kinds of surface similarities may make it impossible for nonspecialists to determine which fields are scientific and which fields are not. Indeed, even if everyone agreed that astronomy is science but astrology is not, it would be important from a philosophical perspective to justify this determination. Popper calls this the Problem of Demarcation: how to separate scientific theories from nonscientific ones.

Popper answered this question by proposing the principle of falsifiability. He required that, in order for a theory to be scientific, it must make a prediction with enough confidence that, if the prediction disagreed with the actual outcome of an appropriate experiment or observation, the theory would be discarded. In other words, a scientist proposing a new theory must be willing to risk embarassment if it turns out the theory does not agree with reality. This rule prevents people from constructing grandiose theories that have no empirical consequences. It also prevents people from using a theory as a lens, that distorts all observations so as to render them compatible with its abstractions. If Aristotle had been aware of the idea of falsifiability, he might have avoided developing his silly theory of the four elements, by realizing that it made no concrete predictions.

In terms of the notation developed above, the falsifiability principle requires that a theory can be instantiated as a function that applies to some real world configurations. Furthermore, the theory must designate a configuration and a prediction with enough confidence that if the experiment is done, and the resulting value does not agree with the prediction , then the theory is discarded. This condition is fairly weak, since it requires a prediction for only a single configuration. The point is that the falsifiability principle does not say anything about the value of a theory, it only states a requirement for the theory to be considered scientific. It is a sort of precondition, that guarantees that the theory can be evaluated in relation to other theories. It is very possible for a theory to be scientific but wrong.

In addition to marking a boundary between science and pseudoscience, the falsifiability principle also permits one to delineate between statements of mathematics and empirical science. Mathematical statements are not falsifiable in the same way empirical statements are. Mathematicians do not and can not use the falsifiability principle; their results are verified using an alternate criterion: the mathematical proof. No new empirical observation or experiment could falsify the Pythagorean theorem. A person who drew a right triangle and attempted to show that the length of its sides did not satisfy would just be ridiculed. Mathematical statements are fundamentally implications: if the axioms are satisfied, then the conclusions follows logically.

The falsifiability principle is strong medicine, and comes, as it were, with a set of powerful side-effects. Most prominently, the principle allows one to conclude that a theory is false, but provides no mechanism whatever to justify the conclusion that a theory is true. This fact is rooted in one of the most basic rules of logical inference: it is impossible to assert universal conclusions on the basis of existential premises. Consider the theory “all swans are white”. The sighting of a black swan, and the resulting premise “some swans are black”, leads one to conclude that the theory is false. But no matter how many white swans one may happen to observe, one cannot conclude with perfect confidence that the theory is true. According to Popper, the only way to establish a scientific theory is to falsify all of its competitors. But because the number of competitors is vast, they cannot all be disqualified. This promotes a stance of radical skepticism towards scientific knowledge.

1.1.6 Science as a Search Through Theory-Space

Though the scientific is not monolithic or precisely defined, the following list describes it fairly well:

  1. Through observation and experiment, amass an initial corpus of configuration-outcome pairs relating to some phenomenon of interest.

  2. Let be the initial champion theory.

  3. Through observation and analysis, develop a new theory, which may either be a refinement of the champion theory, or something completely new. Prefer simpler candidate theories to more complex ones.

  4. Instantiate the new theory in a predictive function . If this cannot be done, the theory is not scientific.

  5. Find a configuration for which , and run the indicated experiment.

  6. If the outcome agrees with the rival theory, , then discard the old champion and set . Otherwise discard .

  7. Return to step #3.

The scientific process described above makes a crucial assumption, which is that perfect agreement between theory and experiment can be observed, such that . In practice, scientists never observe but rather . This fact does not break the process described above, because even if neither theory is perfectly correct, it is reasonable to assess one theory as “more correct” than another and thereby discard the less correct one. However, the fact that real experiments never agree perfectly with theoretical predictions has important philosophical consequences, because it means that scientists are searching not for perfect truth but for good approximations. Most physicists will admit that even their most refined theories are mere approximations, though they are spectacularly accurate approximations.

In the light of this idea about approximation, the following conception of science becomes possible. Science is a search through a vast space that contains all possible theories. There is some ideal theory , which correctly predicts the outcome of all experimental configurations. However, this ideal theory can never be obtained. Instead, scientists proceed towards through a process of iterative refinement. At every moment, the current champion theory is the best known approximation to And each time a champion theory is unseated in favor of a new candidate, the new is a bit closer to .

This view of science as a search for good approximations brings up another nonobvious component of the philosophical foundations of empirical science. If perfect truth cannot be obtained, why is it worth expending so much effort to obtain mere approximations? Wouldn’t one expect that using an approximation might cause problems at crucial moments? If the theory that explains an airplane’s ability to remain aloft is only an approximation, why is anyone willing to board an airplane? The answer is, of course, that the approximation is good enough. The fact that perfection is unachievable does not and should not dissuade scientists from reaching toward it. A serious runner considers it deeply meaningful to attempt to run faster, though it is impossible for him to complete a mile in less than a minute. In the same way, scientists consider it worthwhile to search for increasingly accurate approximations, though perfect truth is unreachable.

1.1.7 Circularity Commitment and Reusability Hypothesis

Empirical scientists follow a unique conceptual cycle in their work that begins and ends in the same place. Mathematicians start from axioms and move on to theorems. Engineers start from basic components and assemble them into more sophisticated devices. An empirical scientist begins with an experiment or set of observations that produce measurements. She then contemplates the data and attempts to understand the hidden structure of the measurements. If she is smart and lucky, she might discover a theory of the phenomenon. To test the theory, she uses it to make predictions regarding the original phenomenon. In other words, the same phenomenon acts as both the starting point and the ultimate justification for a theory. This dedication to the single, isolated goal of describing a particular phenomenon is called the Circularity Commitment.

The nonobviousness of the Circularity Commitment can be understood by considering the alternative. Imagine a scientific community in which theories are not justified by their ability to make empirical predictions, but by their practical utility. For example, a candidate theory of thermodynamics might be evaluated based on whether it can be used to construct combustion engines. If the engine works, the theory must be good. This reasoning is actually quite plausible, but science does not work this way. No serious scientist would suggest that because the theory of relativity is not relevant to or useful for the construction of airplanes, it is not an important or worthwhile theory. Modern physicists develop theories regarding a wide range of esoteric topics such as quantum superfluidity and the entropy of black holes without concerning themselves with the practicality of those theories. Empirical scientists are thus very similar to mathematicians in the purist attitude they adopt regarding their work.

In a prescientific age, a researcher expressing this kind of dedication to pure empirical inquiry, especially given the effort required to carry out such an inquiry, might be viewed as an eccentric crank or religious zealot. In modern times no such stigma exists, because everyone can see that empirical science is eminently practical. This leads to another deeply surprising idea, here called the Reusability Hypothesis: in spite of the fact that scientists are explicitly unconcerned with the utility of their theories, it just so happens that those theories tend to be extraordinarily useful. Of course, no one can know in advance which areas of empirical inquiry will prove to be technologically relevant. But the history of science demonstrates that new empirical theories often catalyze the development of amazing new technologies. Thus Maxwell’s unified theory of electrodynamics led to a wide array of electronic devices, and Einstein’s theory of relativity led to the atomic bomb. The fact that large sums of public money are spent on constructing ever-larger particle colliders is evidence that the Reusability Hypothesis is well understood even by government officials and policy makers.

The Circularity Commitment and the Reusability Hypothesis complement each other naturally. Society would never be willing to fund scientific research if it did not produce some tangible benefits. But if society explicitly required scientists to produce practical results, the scope of scientific investigation would be drastically reduced. Einstein would not have been able to justify his research into relativity, since that theory had few obvious applications at the time it was invented. The two philosophical ideas justify a fruitful division of labor. Scientists aim with intent concentration at a single target: the development of good empirical theories. They can then hand off their theories to the engineers, who often find the theories to be useful in the development of new technologies.

1.2 Sophie’s Method

This section develops a refined version of the scientific method, in which large databases are used instead of experimental observations as the necessary empirical ingredient. The necessary modifications are fairly minor, so the revised version includes all of the same conceptual apparatus of the standard version. At the same time, the modification is significant enough to considerably expand the scope of empirical science. The refined version is developed through a series of thought experiments relating to a fictional character named Sophie.

1.2.1 The Shaman

Sophie is a assistant professor of physics at a large American state university. She finds this job vexing for several reasons, one of which is that she has been chosen by the department to teach a physics class intended for students majoring in the humanities, for whom it serves to fill a breadth requirement. The students in this class, who major in subjects like literature, religious studies, and philosophy, tend to be intelligent but also querulous and somewhat disdainful of the “merely technical” intellectual achievements of physics.

In the current semester she has become aware of the presence in her class of a discalced student with a large beard and often bloodshot eyes. This student is surrounded by an entourage of similarly strange looking followers. Sophie is on good terms with some of the more serious students in the class, and in conversation with them has found out that the odd student is attempting to start a new naturalistic religious movement and refers to himself as a “shaman”.

One day while delivering a simple lecture on Newtonian mechanics, Sophie is surprised when the shaman raises his hand. When Sophie calls on him, he proceeds to claim that physics is a propagandistic hoax designed by the elites as a way to control the population. Sophie blinks several times, and then responds that physics can’t be a hoax because it makes real-world predictions that can be verified by independent observers. The shaman counters by claiming that the so-called “predictions” made by physics are in fact trivialities, and that he can obtain better forecasts by communing with the spirit world. He then proceeds to challenge Sophie to a predictive duel, in which the two of them will make forecasts regarding the outcome of a simple experiment, the winner being decided based on the accuracy of the forecasts. Sophie is taken aback by this but, hoping that by proving the shaman wrong she can break the spell he has cast on some of the other students, agrees to the challenge.

During the next class, Sophie sets up the following experiment. She uses a spring mechanism to launch a ball into the air at an angle . The launch mechanism allows her to set the initial velocity of the ball to a value of . She chooses as a predictive test the problem of predicting the time that the ball will fall back to the ground after being launched at . Using a trivial Newtonian calculation she concludes that , sets and to give a value of seconds, and announces her prediction to the class. She then asks the shaman for his prediction. The shaman declares that he must consult with the wind spirits, and then spends a couple of minutes chanting and muttering. Then, dramatically flaring open his eyes as if to signify a moment of revelation, he grabs a piece of paper, writes his prediction on it, and then hands it to another student. Sophie suspects some kind of trick, but is too exasperated to investigate and so launches the ball into the air. The ball is equipped with an electronic timer that starts and stops when an impact is detected, and so the number registered in the timer is just the time of flight . A student picks up the ball and reports that the result is . The shaman gives a gleeful laugh, and the student holding his written prediction hands it to Sophie. On the paper is written . The shaman declares victory: his prediction turned out to be correct, while Sophie’s was incorrect (it was off by seconds).

To counter the shaman’s claim and because it was on the syllabus anyway, in the next class Sophie begins a discussion of probability theory. She goes over the basic ideas, and then connects them to the experimental prediction made about the ball. She points out that technically, the Newtonian prediction

is not an assertion about the exact value of the outcome. Rather it should be interpreted as the mean of a probability distribution describing possible outcomes. For example, one might use a normal distribution with mean

and . The reason the shaman superficially seemed to win the contest is that he gave a probability distribution while Sophie gave a point prediction; these two types of forecast are not really comparable. In the light of probability theory, the reason to prefer the Newtonian prediction above the shamanic one, is that it assigns a higher probability to the outcome that actually occurred. Now, plausibly, if only a single trial is used then the Newtonian theory might simply have gotten lucky, so the reasonable thing to do is combine the results over many trials, by multiplying the probabilities together. Therefore, the formal justification for preferring the Newtonian theory to the shamanic theory is that:

Where the index runs over many trials of the experiment. Sophie then shows how the Newtonian probability predictions are both more confident and more correct than the shamanic predictions. The Newtonian predictions assign a very large amount of probability to the region around the outcome , and in fact it turns out that almost all of the real data outcomes fall in this range. In contrast, the shamanic prediction assigns a relatively small amount of probability to the region, because he has predicted a very wide interval (). Thus while the shamanic prediction is correct, it is not very confident. The Newtonian prediction is correct and highly confident, and so it should be prefered.

Sophie tries to emphasize that the Newtonian probability prediction only works well for the real data. Because of the requirement that probability distributions be normalized, the Newtonian theory can only achieve good results by reassigning probability towards the region around and away from other regions. A theory that does not perform this kind of reassignment cannot achieve superior high performance.

Sophie recalls that some of the students are studying computer science and for their benefit points out the following. The famous Shannon equation governs the relationship between the probability of an outcome and the length of the optimal code that should be used to represent it. Therefore, given a large data file containing the results of many trials of the ballistic motion experiment, the two predictions (Newtonian and shamanic) can both be used to build specialized programs to compress the data file. Using the Shannon equation, the above inequality can be rewritten as follows:

This inequality indicates an alternative criterion that can be used to decide between two rival theories. Given a data file recording measurements related to a phenomenon of interest, a scientific theory can be used to write a compression program that will shrink the file to a small size. To decide between two rival theories of the same phenomenon, one invokes the corresponding compressors on a shared benchmark data set, and prefers the theory that achieves a smaller encoded file size. This criterion is equivalent to the probability-based one, but has the advantage of being more concrete, since the quantities of interest are file lengths instead of probabilities.

1.2.2 The Dead Experimentalist

Sophie is a theoretical physicist and, upon taking up her position as assistant professor, began a collaboration with a brilliant experimental physicist who had been working at the university for some time. The experimentalist had previously completed the development of an advanced apparatus that allowed the investigation of an exotic new kind of quantum phenomenon. Using data obtained from the new system, Sophie made rapid progress in developing a mathematical theory of the phenomenon. Tragically, just before Sophie was able complete her theory, the experimentalist was killed in a laboratory explosion that also destroyed the special apparatus. After grieving for a while, Sophie decided that the best way to honor her friend’s memory would be to bring the research they had been working on to a successful conclusion.

Unfortunately, there is a critical problem with Sophie’s plan. The experimental apparatus had been completely destroyed, and Sophie’s late partner was the only person in the world who could have rebuilt it. He had run many trials of the system before his death, so Sophie had a quite large quantity of data. But she had no way of generating any new data. Thus, no matter how beautiful and perfect her theory might be, she had no way of testing it by making predictions.

One day while thinking about the problem Sophie recalls the incident with the shaman. She remembers the point she had made for the benefit of the software engineers, about how a scientific theory could be used to compress a real world data set to a very small size. Inspired, she decides to apply the data compression principle as a way of testing her theory. She immediately returns to her office and spends the next several weeks writing Matlab code, converting her theory into a compression algorithm. The resulting compressor is successful: it shrinks the corpus of experimental data from an initial size of bits to an encoded size of bits. Satisfied, Sophie writes up the theory, and submits it to a well-known physics journal.

The journal editors like the theory, but are a bit skeptical of the compression based method for testing the theory. Sophie argues that if the theory becomes widely known, one of the other experts in the field will develop a similar apparatus, which can then be used to test the theory in the traditional way. She also offers to release the experimental data, so that other researchers can test their own theories using the same compression principle. Finally she promises to release the source code of her program, to allow external verification of the compression result. These arguments finally convince the journal editors to accept the paper.

1.2.3 The Rival Theory

After all the mathematics, software development, prose revisions, and persuasion necessary to complete her theory and have the paper accepted, Sophie decides to reward herself by living the good life for a while. She is confident that her theory is essentially correct, and will eventually be recognized as correct by her colleagues. So she spends her time reading novels and hanging out in coffee shops with her friends.

A couple of months later, however, she receives an unpleasant shock in the form of an email from a colleague which is phrased in consolatory language, but does not contain any clue as to why such language might be in order. After some investigation she finds out that a new paper has been published about the same quantum phenomenon of interest to Sophie. The paper proposes a alternative theory of the phenomenon which bears no resemblance whatever to Sophie’s. Furthermore, the paper reports a better compression rate than was achieved by Sophie, on the database that she released.

Sophie reads the new paper and quickly realizes that it is worthless. The theory depends on the introduction of a large number of additional parameters, the values of which must be obtained from the data itself. In fact, a substantial portion of the paper involves a description of a statistical algorithm that estimates optimal parameter values from the data. In spite of these aesthetic flaws, she finds that many of her colleagues are quite taken with the new paper and some consider it to be “next big thing”. Sophie sends a message to the journal editors describing in detail what she sees as the many flaws of the upstart paper. The editors express sympathy, but point out that the new theory outperforms Sophie’s theory using the performance metric she herself proposed. The beauty of a theory is important, but its correctness is ultimately more important.

Somewhat discouraged, Sophie sends a polite email to the authors of the new paper, congratulating them on their result and asking to see their source code. Their response, which arrives a week later, contains a vague excuse about how the source code is not properly documented and relies on proprietary third party libraries. Annoyed, Sophie contacts the journal editors again and asks them for the program they used to verify the compression result. They reply with a link to a binary version of the program.

When Sophie clicks on the link to download the program, she is annoyed to find it has a size of 800 megabytes. But her annoyance is quickly transformed into enlightenment, as she realizes what happened, and that her previous philosophy contained a serious flaw. The upstart theory is not better than hers; it has only succeeded in reducing the size of the encoded data by dramatically increasing the size of the compressor. Indeed, when dealing with specialized compressors, the distinction between “program” and “encoded data” becomes almost irrelevant. The critical number is not the size of the compressed file, but the net size of the encoded data plus the compressor itself.

Sophie writes a response to the new paper which describes the refined compression rate principle. She begins the paper by reiterating the unfortunate circumstances which forced her to appeal to the principle, and expressing the hope that someday an experimental group will rebuild the apparatus developed by her late partner, so that the experimental predictions made by the two theories can be properly tested. Until that day arrives, standard scientific practice does not permit a decisive declaration of theoretical success. But surely there is some theoretical statement that can be made in the meantime, given the large quantity of data that is available. Sophie’s proposal is that the goal should be to find the theory that has the highest probability of predicting a new data set, when it can finally be obtained. If the theories are very simple in comparison to the data being modeled, then the size of the encoded data file is a good way of choosing the best theory. But if the theories are complex, then there is a risk of overfitting the data. To guard against overfitting complex theories must be penalized; a simple way to do this is to take into account the codelength required for the compressor itself. The length of Sophie’s compressor was negligible, so the net score of her theory is just the codelength of the encoded data file: bits. The rival theory achieved a smaller size of for the encoded data file, but required a compressor of bits to do so, giving a total score of bits. Since Sophie’s net score is lower, her theory should be prefered.

1.3 Compression Rate Method

In the course of the thought experiments discussed above, the protagonist Sophie articulated a refined version of the scientific method. This procedure will be called the Compression Rate Method (CRM). The web of concepts related to the CRM will be called the comperical philosophy of science, for reasons that will become evident in the next section. The CRM consists of the following steps:

  1. Obtain a vast database relating to a phenomenon of interest.

  2. Let be the initial champion theory.

  3. Through observation and analysis, develop a new theory , which may be either a simple refinement of or something radically new.

  4. Instantiate as a compression program. If this cannot be done, then the theory is not scientific.

  5. Score the theory by calculating , the sum of the encoded version of and the length of the compressor.

  6. If , then discard the old champion and set . Otherwise discard .

  7. Return to step #3.

It is worthwhile to compare the CRM to the version of the scientific method given in Section 1.1.6. One improvement is that in this version the Occam’s Razor principle plays an explicit role, through the influence of the term. A solution to the Problem of Demarcation is also built into the process in Step #4. The main difference is that the empirical ingredient in the CRM is a large database, while the traditional method employs experimental observation.

The significance of the CRM can be seen by understanding the relationship between the target database and the resulting theories. If contains data related to the outcomes of physical experiments, then physical theories will be necessary to compress it. If contains information related to interest rates, house prices, global trade flows, and so on, then economic theories will be necessary to compress it. One obvious choice for is simply an enormous image database, such as the one hosted by the Facebook social networking site. In order to compress such a database one must develop theories of visual reality. The idea that there can be an empirical science of visual reality has never before been articulated, and is one of the central ideas of this book. A key argument, contained in Chapter 3, is that the research resulting from the application of the CRM to a large database of natural images will produce a field very similar to modern computer vision. Similarly, Chapter 4 argues that the application of the CRM to a large text corpus will result in a field very similar to computational linguistics. Furthermore, the reformulated versions of these fields will have far stronger philosophical foundations, due to the explicit connection between the CRM and the traditional scientific method.

It is crucial to emphasize the deep connection between compression and prediction. The real goal of the CRM is to evaluate the predictive power of a theory, and the compression rate is just a way of quantifying that power. There are three advantages to using the compression rate instead of some measure of predictive accuracy. First, the compression rate naturally accomodates a model complexity penalty term. Second, the compression rate of a large database is an objective quantity, due to the ideas of Kolmogorov complexity and universal computation, discussed below. Third, the compression principle provides an important verificational benefit. To verify a claim made by an advocate of a new theory, a referee only needs to check the encoded file size, and ensure that the resulting decoded data matches exactly the original database .

Most people express skepticism as their first reaction to the plan of research embodied by the CRM. They generally admit that it may be possible to use the method to obtain increasingly short codes for the target databases. But they balk at accepting the idea that the method will produce anything else of value. The following sections argue that the philosophical commitments implied by the CRM are exactly analogous to those long accepted by scientists working in mainstream fields of empirical science. Comperical science is nonobvious in the year 2011 for exactly the same kinds of reasons that empirical science was nonobvious in the year 1511.

Figure 1.1: Histograms of differences between values of neighboring pixels in a natural image (left) and a random image (right). The clustering of the pixel difference values around 0 in the natural image is what allows compression formats like PNG to achieve compression. Note the larger scale of the image on the left; both histograms represent the same number of pixels.

1.3.1 Data Compression is Empirical Science

The following theorem is well known in data compression. Let be a program that losslessly compresses bit strings , assigning each string to a new code with length . Let

be the uniform distribution over

-bit strings. Then the following bound holds for all compression programs :


In words the theorem states that no lossless compression program can achieve average codelengths smaller than bits, when averaged over all possible bit input strings. Below, this statement is referred to as the “No Free Lunch” (NFL) theorem of data compression as it implies that one can achieve compression for some strings only at the price of inflating other strings. At first glance, this theorem appears to turn the CRM proposal into nonsense. In fact, the theorem is the keystone of the comperical philosophy because it shows how lossless, large-scale compression research must be essentially empirical in character.

To see this point, consider the following apparent paradox. In spite of the NFL theorem, lossless image compression programs exist and have been in widespread use for years. As an example, the well-known Portable Network Graphics (PNG) compression algorithm seems to reliably produce encoded files that are 40-50% shorter than would be achieved by a uniform encoding. This apparent success seems to violate the No Free Lunch theorem.

The paradox is resolved by noticing that the images used to evaluate image compression algorithms are not drawn from a uniform distribution over images. If lossless image formats were evaluated based on their ability to compress random images, no such format could ever be judged successful. Instead, the images used in the evaluation process belong to a very special subset of all possible images: those that arise as a result of everyday human photography. This “real world” image subset, though vast in absolute terms, is miniscule compared to the space of all possible images. So PNG is able to compress a certain image subset, while inflating all other images. And the subset that PNG is able to compress happens to overlap substantially with the real world image subset.

The specific empirical regularity used by the PNG format is that in real world images, adjacent pixel values tend to have very similar values. A compressor can exploit this property by encoding the differences between neighboring pixel values instead of the values themselves. The distribution of differences is very narrowly clustered around zero, so they can be encoded using shorter average codes (see Figure 1.1). Of course, this trick does not work for random images, in which there is no correlation between adjacent pixels.

The NFL theorem indicates that in order to succeed, a compericalresearcher must follow a strategy analogous to the procedure of physics. First, she must attempt to discover some structures or patterns present in real world images. Then she must develop a mathematical theory characterizing that structure, and build the theory into a compressor. Finally, she must demonstrate that the theory corresponds to reality, by showing that it achieves an improved compression rate.

To make statements about the world, physicists need to combine mathematical and empirical reasoning; neither alone is sufficient. Consider the following statement of physics: when a ball is tossed into the air, its vertical position will be described by the equation: . That statement can be decomposed into a mathematical and an empirical component. The mathematical statement is: if a quantity’s evolution in time is governed by the differential equation , where is some constant, then its value is given by the function , where and are determined by the initial conditions. The empirical statement is: if a ball is thrown into the air, its vertical position will be governed by the differential equation , where is the acceleration due to gravity. By combining these statements together, the physicist is able to make a variety of predictions.

Just like physicists, comperical researchers must combine mathematical statements with empirical statements in order to make predictions. Because of the NFL theorem, pure mathematics is never sufficient to reach conclusions of the form: “Algorithm Y achieves good compression.” Mathematical reasoning can only be used to make implications: “If the images exhibit property X, then algorithm Y will achieve good compression”. In order to actually achieve compression, it is necessary to demonstrate the empirical fact that the images actually have property X. This shows why the comperical proposal is not fundamentally about saving disk space or bandwidth; it is fundamentally about characterizing the properties of images or other types of data.

1.3.2 Comparison to Popperian Philosophy

The comperical philosophy of science bears a strong family resemblance to the Popperian one, and inherits many of its conceptual advantages. First, the compression principle provides a clear answer to the Problem of Demarcation: a theory is scientific if and only if it can be used to build a compressor for an appropriate kind of database. Because of the intrinsic difficulty of lossless data compression, the only way to save bits is to explicitly reassign probability away from some outcomes and toward other outcomes. If the theory assigns very low probability to an outcome which then occurs, this suggests that the theory has low quality and should be discarded. Thus, the probability reassignment requirement is just a graduated or continuous version of the falsification requirement. The falsifiability principle means that a researcher hoping to prove the value of his new theory must risk embarassment if his predictions turn out to be incorrect. The compression principle requires a researcher to face the potential for embarassment if his new theory ends up inflating the database.

One difference between Popperian view and comperical view is that the former appears to justify stark binary assessments regarding the truth or falsehood of a theory, while the latter provides only a number which can be compared to other numbers. If theories are either true or false, then the compression principle is no more useful than the falsifiability principle. But if theories can exist on some middle ground between absolute truth and its opposite, then it makes sense to claim that one theory is relatively more true than another, even if both are imperfect. The compression principle can be used to justify such claims. Falsifiability consigns all imperfect theories to the same garbage bin; compression can be used to rescue the valuable theories from the bin, dust them off, and establish them as legitimate science.

The falsifiability idea seems to imply that theories can be evaluated in isolation: a theory is either true or false, and this assessment does not depend on the content of rival theories. In contrast, while the compression idea assigns a score to an individual theory, this score is useful only for the purpose of comparison. This distinction may be conceptually significant to some people, but in practice it is unimportant. Science is a search for good approximations; science proceeds by incrementally improving the quality of the approximations. The power of the falsifiability requirement is that it enables a rapid search through the theory-space by ensuring that theories can be decisively compared. The compression requirement provides exactly the same benefit. When a researcher proposes a new theory and shows that it can achieve a smaller compressed file size for the target database, this provides decisive evidence that the new theory is superior. Furthermore, both principles allow a research community to identify a champion theory. In the Popperian view, the champion theory is the one that has withstood all attempts at falsification. In the comperical view, the champion theory is the one that achieves the smallest codelength on the relevant benchmark database.

One of the core elements of Popper’s philosophy is the dedication to the continual testing, examination, and skepticism of scientific theories. A Popperian scientist is never content with the state of his knowledge. He never claims that a theory is true; he only accepts that there is currently no evidence that would falsify it. The comperical philosopher takes an entirely analogous stance. To her, a theory is never true or even optimal, it is only the best theory that has thus far been discovered. She will never claim, “the probability of event X is 35%”. Instead, she would state that “according to the current champion theory, the probability of event X is 35%”. She might even make decisions based on this probability assignment. But if a new theory arrives that provides a better codelength, she immediately replaces her probability estimates and updates her decision policy based on the new theory.

The Popperian commitment to continual examination and criticism of theoretical knowledge is good discipline, but the radical skepticism it promotes is probably a bit too extreme. A strict Popperian would be unwilling to use Newtonian physics once it was falsified, in spite of the fact that it obviously still works for most problems of practical interest. The compression principle promotes a more nuanced view. If a claim is made that a theory provides a good description of a certain phenomenon, and the claim is justified by demonstrating a strong compression result, then the claim is valid for all time. It is possible to develop a new theory that achieves a better compression rate, or to show that the previous theory does not do as well on another related database. These circumstances might suggest that the old theory should no longer be used. But if the old theory provided a good description of a particular database, no future developments will change that fact. This captures the intuition that Newtonian physics still provides a perfectly adequate description of a wide range of phenomena; Eddington’s solar eclipse photographs simply showed that there are some phenomena to which it does not apply.

1.3.3 Circularity and Reusability in Context of Data Compression

Just like empirical scientists do, comperical researchers adopt a the Circularity Commitment to guide and focus their efforts. A comperical researcher evaluates a new theory based on one and only one criterion: its ability to compress the database for which it was developed. A community using a large collection of face images will be highly interested in various tools such as hair models, eyeglass detectors, and theories of lip color, and only secondarily interested in potential applications of face modeling technology. If the researchers chose to introduce some additional considerations to the theory comparison process, such as the relevance of a theory to a certain type of practical task, they would compromise their own ability to discard low quality theories and identify high quality ones.

Some truly purist thinkers may consider large scale data compression as an intrinsically interesting goal. Comperical researchers will face many challenging problems, involving mathematics, algorithm design, statistical inference, and knowledge representation. Furthermore, researchers will receive a clear signal indicating when they have made progress, and how much. For a certain type of intellectual, these considerations are very significant, even if there is no reason to believe that the investigation will yield any practical results.

In this light, it is worth comparing the proposed field of large scale lossless data compression with the established field of computer chess. Chess is an abstract symbolic game with very little connection to the real world. A computer chess advocate would find it quite difficult to convince a skeptical audience that constructing powerful chess programs would yield any tangible benefit. However, like the compression goal, the computer chess goal is attractive because it produces a variety of subproblems, and also provides a method for making decisive comparisons between rival solutions. For these reasons, computer scientists devoted a significant amount of effort to the field, leading some to claim that chess was “the drosophila of AI research”. Furthermore, these efforts were incredibly successful, and led to the historic defeat of the top ranked human grandmaster, Gary Kasparov, by IBM’s Deep Blue in 1997. Most scientists would agree that this event was an important advance for human knowledge, even if it did not lead to any practical applications. Because of its similar methodological advantages, comperical research has a similar potential to advance human knowledge.

For the reader who is unmoved by the argument about the intrinsic interest of compression science, it is essential to defend the validity of the Reusability Hypothesis in the context of data compression. The hypothesis really contains two separate pieces. First, theories employ abstractions, and good theories use abstractions that correspond to reality. So the abstraction called “mass” is not just a clever computational trick, but represents a fundamental aspect of reality. These real abstractions are useful both for compression and for practical applications. The second piece of the Reusability Hypothesis is that, while theories based on naïve or simplistic characterizations of reality can achieve compression, the best codelengths will be achieved by theories that use real abstractions. So by vigorously pursuing the compression goal, researchers can identify the real abstractions governing a particular phenomenon, and those abstractions can be reused for practical applications.

The following examples illustrate the idea of the Reusability Hypothesis. Consider constructing a target database by setting up a video camera next to a highway and recording the resulting image stream. One way to predict image frames (and thus compress the data) would be to identify batches of pixels corresponding to a car, and use an estimate of the car’s velocity to interpolate the pixels forward. A compressor that uses this trick thus implicitly contains abstractions related to the concepts of “car” and “velocity”. Since these are real abstractions, the Reusability Hypothesis states that the specialized compressor should achieve better compression rates than a more generic one. Another good example of this idea relates to text compression. Here, the Reusability Hypothesis states that a specialized compressor making use of abstractions such as verb conjugation patterns, parts of speech, and rules of grammar will perform better than a generic compressor. If the hypothesis is true, then the same division of labor between scientists and engineers that works for mainstream fields will work here as well. The comperical scientists obtain various abstractions by following the compression principle, and hand them off to the engineers, who will find them very useful for developing applications like automatic license plate readers and machine translation systems.

1.3.4 The Invisible Summit

An important concept related to the Compression Rate Method is called the Kolmogorov complexity. The Kolmogorov complexity of a string is the length of the shortest program that will output

when run on a Turing machine

. The key property of the Kolmogorov complexity comes about as a consequence of the idea of universal computation. If a Turing machine (roughly equivalent to a programming language) is of sufficient complexity, it becomes universal: it can simulate any other Turing machine, if given the right simulator program. So given a string and a short program that outputs it when run on Turing machine A, one can easily obtain a program that outputs when run on (universal) Turing machine B, just by prepending a simulator program to , and . Now, the simulator program is fixed by the definition of the two Turing machines. Thus for very long and complex strings, the contribution of the simulator to the total program length becomes insignificant, so that , and thus the Kolmogorov complexity is effectively independent of the choice of Turing machine.

Unfortunately or not, a brief proof shows that the Kolmogorov complexity is incomputable: a program attempting to compute cannot be guaranteed to terminate in finite time. This is not surprising, since if a method for computing the Kolmogorov complexity were found, it would be immensely powerful. Such a program would render theoretical physicists unnecessary. Experimental physicists could simply compile a large database of observations, and feed the database to the program. Since the optimal theory of physics corresponds provides the best explanation, and thus the shortest encoding, of the data, the program would automatically find the optimal theory of physics on its way to finding the Kolmogorov complexity.

Another way of seeing the impossibility of finding is by imagining what it would mean to find the Kolmogorov complexity of the Facebook image database. To compress this database to the smallest possible size, one would have to know : the probability distribution generating the Facebook images. While may look innocuous, in fact it is a mathematical object of vast complexity, containing an innumerable quantity of details. To begin with, it must contain a highly sophisticated model of the human face. It must contain knowledge of hair styles and facial expressions. It must capture the fact that lips are usually reddish in color, and that women are more likely to enhance this color using lipstick. Moving on from there, it would require knowledge about other things people like to photograph, such as pets, natural scenery, weddings, and boisterous parties. It would need to contain details about the appearance of babies, such as the fact that a baby usually has a pink face, and its head is large in proportion to the rest of its body. All this knowledge is necessary because, for example, must assign higher probability, and shorter codelength, to an image featuring a woman with red lips, than to an image that is identical in every way except that the woman has green lips.

While calculating is impossible in general, one can find upper bounds to it. Indeed, the Compression Rate Method is just the process of finding a sequence of increasingly tight upper bounds on the Kolmogorov complexity of the target database. Each new champion theory corresponds to a tighter upper bound. In the case of images, a new champion theory corresponds to to a new model of the probability of an image. Every iteration of theory refinement packages more realistic information into the model , thereby bringing it closer to the unknowable . This process is exactly analogous to the search through the theory space carried out by empirical scientists. Both empirical scientists and comperical scientists recognize that their theories are mere approximations. The fact that perfect truth cannot be obtained simply does not matter: it is still worthwhile to climb towards the invisible summit.

1.3.5 Objective Statistics

Due to the direct relationship between statistical modeling and data compression (see Appendix A), comperical research can be regarded as a subfield of statistics. A traditional problem in statistics starts with a set of observations of some quantity, such as the physical height of a population. By analyzing the data set, the statistician attempts to obtain a good estimate

of the probability of a given height. This model could be, for example, a Gaussian distribution with a given mean and variance. Comperical research involves an entirely analogous process. The difference is that instead of simple single-dimensional numbers, comperical statisticians analyze complex data objects such as images or sentences, and attempt to find good models of the probability of such objects.

All statistical inference must face a deep conceptual issue that has been the subject of acrimonious debate and philosophical speculation since the time of David Hume, who first identified it. This is the Problem of Induction: when is it justified to jump from a limited set of specific observations (the data samples) to a universal rule describing the observations (the model)? This problem has divided statisticians into two camps, the Bayesians and the frequentists, who disagree fundamentally about the meaning and justification of statistical inference. A full analysis of the nature of this disagreement would require its own book, but a very rough summary is that, while the Bayesian approach has a number of conceptual benefits, it is hobbled by its dependence on the use of prior distributions. A Bayesian performs inference by using Bayes rule to update a prior distribution in response to evidence, thus producing a posterior distribution, which can be used for decision-making and other purposes. The critical problem is that is no objective way to choose a prior. Furthermore, two Bayesians who start with different priors will reach different conclusions, in spite of observing the same evidence. The use of Bayesian techniques to justify scientific conclusions therefore deprives science of objectivity.

Any data compressor must implement a mapping from data sets to bit strings of length . This mapping defines an implicit probability distribution . It appears, therefore, that comperical statisticians make the same commitment to the use of prior distributions as the Bayesians do. However, there is a crucial subtlety here. Because the length of the compressor itself is taken into account in the CRM, the prior distribution is actually defined by the choice of programming language used to write the compressor. Furthermore, comperical researchers use their models to describe vast datasets. Combined, these two facts imply that comperical statistical inference is objective. This idea is illustrated by the following thought experiment.

Imagine a research subfield which has established a database as its target for CRM-style investigation. The subfield makes slow but steady progress for several years. Then, out of the blue, an unemployed autodidact from a rural village in India appears with a bold new theory. He claims that his theory, instantiated in a program , achieves a compression rate which is dramatically superior to the current best published results. However, among his other eccentricities, this gentleman uses a programming language he himself developed, which corresponds to a Turing machine . Now, the other researchers of the field are well-meaning but skeptical, since all the previously published results used a standard language corresponding to a Turing machine . But it is easy for the Indian maverick to produce a compressor that will run on : he simply appends to a simulator program , that simulates when run on . The length of the new compressor is , and all of the other researchers can confirm this. Now, assuming the data set is large and complex enough so that , then the codelength of the modified version is effectively the same as the original: . This shows that there can be no fundamental disagreement among comperical researchers regarding the quality of a new result.

1.4 Example Inquiries

This section makes the makes the abstract discussion above tangible by describing several concrete proposals. These proposals begin with a method of constructing a target database, which defines a line of inquiry. In principle, researchers can use any large database that is not completely random as a starting point for a comperical investigation. In practice, unless some care is exercised in the construction of the target dataset, it will be difficult to make progress. In the beginning stages of research, it will be more productive to look at data sources which display relatively limited amounts of variation. Here are some examples inquiries that might provide good starting points:

  • Attempt to compress the immense image database hosted by the popular Facebook social networking web site. One obvious property of these images is that they contain many faces. To compress them well, it will be necessary to develop a computational understanding of the appearance of faces.

  • Construct a target database by packaging together digital recordings of songs, concerts, symphonies, opera, and other pieces of music. This kind of inquiry will lead to theories of the structure of music, which must describe harmony, melody, pitch, rhythm and the relationship between these variables in different musical cultures. It must also contain models of the sounds produced by different instruments, as well as the human singing voice.

  • Build a target database by recording from microphones positioned in treetops. A major source of variation in the resulting data will be bird vocalizations. To compress the data well, it will be necessary to differentiate between bird songs and bird calls, to develop tools that can identify species-characteristic vocalizations, and to build maps showing the typical ranges of various species. In other words, this type of inquiry will be a computational version of the traditional study of bird vocalization carried out by ornithologists.

  • Generate a huge database of economic data showing changes in home prices, interest and exchange rate fluctuations, business inventories, welfare and unemployment applications, and so on. To compress this database well, it will be necessary to develop economic theories that are capable of predicting, for example, the effect that changes in interest rates have on home purchases.

Since the above examples involve empirical inquiry into various aspects of reality, any reader who believes in the intrinsic value of science should regard them as at least potentially interesting. Skeptical readers, on the other hand, may doubt the applicability of the Reusability Hypothesis here, and so view an attempt to compress these databases as an eccentric philosophical quest. The following examples are more detailed, and give explicit analysis of what kinds of theories (or computational tools) will be needed, and how those theories will be more widely useful. An important point, common to all of the investigations, is that a single target database can be used to develop and evaluate a large number of methods.

It should be clear that, if successful, these example inquiries should lead to practical applications. The study of music may help composers to write better music, allow listeners to find new music that suits their taste, and assist music publishing companies in determining the quality of a new piece. The investigation of bird vocalization, if successful, should be useful to environmentalists and bird-watchers who might want to monitor the migration and population fluctuation of various avian species. The study of economic data is more speculative, but if successful should be of obvious interest to policy makers and investors. In the case of the roadside video data described below, the result will be sophisticated visual systems that can be used in robotic cars. Also mentioned below is an inquiry into the structure of English text, which should prove useful for speech recognition as well as for machine translation.

1.4.1 Roadside Video Camera

Consider constructing a target database by setting up a video camera next to a highway, and recording video streams of the passing cars. Since the camera does not move, and there is usually not much activity on the sides of highways, the main source of variation in the resulting video will be the automobiles. Therefore, in order to compress the video stream well, it will be necessary to obtain a good computational understanding of the appearance of automobiles.

A simple first step would be to take advantage of the fact that cars are rigid bodies subject to Newtonian laws of physics. The position and velocity of a car must be continuous functions of time. Given a series of images at timesteps it is possible to predict the image at timestep simply by isolating the moving pixels in the series (these correspond to the car), and interpolating those pixels forward into the new image, using basic rules of camera geometry and calculus. Since neither the background nor the moving pixel blob changes much between frames, it should be possible to achieve a good compression rate using this simple trick.

Further improvements can be achieved by detecting and exploiting patterns in the blob of moving pixels. One observation is that the wheels of a moving car have a simple characteristic appearance: a dark outer ring corresponding to the tire, along with the off-white circle of the hubcap at the center. Because of this characteristic pattern, it should be straightforward to build a wheel detector using standard techniques of supervised learning. One could then save bits by representing the wheel pixels using a specialized model, akin to a graphics program, which draws a wheel of a given size and position. Since it takes fewer bits to encode the size and position parameters than to encode the raw pixels of the wheel, this trick should save codelength. Further progress could be achieved by conducting a study of the characteristic appearance of the surfaces of cars. Since most cars are painted in a single color, it should be possible to develop a specialized algorithm to identify the frame of the car. Another graphics program could be used to draw the frame of the car, using a variety of parameters related to its shape. Extra attention would be required to handle the complex reflective appearance of the windshield, but the same general idea would apply. Note that the encoder always has the option of “backing off”; if attempts to apply more aggressive encoding methods fail (e.g., if the car is painted in multiple colors), then the simpler pixel-blob encoding method can be used instead.

Additional progress could be achieved by recognizing that most automobiles can be categorized into a discrete set of categories (e.g., a 2009 Toyota Corolla). Since these categories have standardized dimensions, bits could be saved by encoding the category of a car instead of information related to its shape. Initially, the process of building category-specific modules for the appearance of a car might be difficult and time-consuming. But once one has developed modules for the Hyundai Sonata, Chevrolet Equinox, Honda Civic, and Nissan Altima, it should not require much additional work to construct a module for the Toyota Sienna. Indeed, it may be possible to develop a learning algorithm that, through some sort of clustering process, would automatically extract, from large quantities of roadside video data, appearance modules for the various car categories.

1.4.2 English Text Corpus

Books and other written materials constitute another interesting source of target data for comperical inquiry. Here one simply obtains a large quantity of text, and attempts to compress it. One tool that will be very useful for the compression of English text is an English dictionary. To see this, consider the following sentence:

John went to the liquor store and bought a bottle of ____.

Assume that the word in the blank space has letters, and the compressor encodes this information separately. A naïve compressor would require bits to encode the word, since there are ways to form an -letter word. A compressor equipped with a dictionary can do much better. First it looks up all the words of length , and then it encodes the index of the actual word in this list. This costs , where is the number of words of length in the dictionary. Since most combinations of letters such as “yttu” and “qwhg” are not real words, and bits are saved.

By making the compressor smart, it’s possible to do even better. A smart compressor should know that the word “of” is usually followed by a noun. So instead of looking up all the -letter words, the compressor could restrict the search to only nouns. This cuts down the number of possibilities even further, saving more bits. An even smarter compressor would know that in the phrase “bottle of X”, the word X is usually a liquid. If it had an enhanced dictionary which contained information about various properties of nouns, it could restrict the search to

-letter nouns that represent liquids. Even better results could be obtained by noticing that the bottle is purchased at a liquor store, and so probably represents some kind of alcohol. This trick would require that the enhanced dictionary contains annotations indicating that words such as “wine”, “beer”, “vodka”, are types of alcoholic beverages. It may be possible to do even better by analyzing the surrounding text. The word list may be narrowed even further if the text indicates that John is fond of brandy, or that his wife is using a recipe that calls for vodka. Of course, these more advanced schemes are far beyond the current state of the art in natural language processing, but they indicate the wide array of techniques that can in theory be brought to bear on the problem.

1.4.3 Visual Manhattan Project

Consider constructing a database target by mounting video cameras on the dashboards of a number of New York City taxi cabs, and recording the resulting video streams. Owing to the vivid visual environment of New York City, such a database would exhibit an immense amount of complexity and variation. Several aspects of that complexity could be then analyzed and studied in depth.

One interesting source of variation in the video would come from the pedestrians. To achieve good compression rates for the pixels representing pedestrians, it would be necessary to develop theories describing the appearance of New Yorkers. These theories would need to include details about clothing, ethnicity, facial appearance, hair style, walking style, and the relationship between these variables. A truly sophisticated theory of pedestrians would need to take into account time and place: it is quite likely to observe a suited investment banker in the financial district on a weekday afternoon, but quite unlikely to observe such a person in the Bronx in the middle of the night.

Another source of variation would come from the building and storefronts of the city. A first steps towards achieving a good compression rate for these pixels would be to construct a three-dimensional model of the city. Such a model could be used not only to determine the location from which an image frame was taken, but also to predict the next frame in the sequence. For example, the model could be used to predict that, if a picture is taken at the corner of 34th Street and Fifth Avenue, the Empire State Building will feature very prominently. Notice that a naïve representation of the 3D model will require a large number of bits to specify, and so even more savings can be achieved by compressing the model itself. This can be done by analyzing the appearance of typical building surfaces such as brick, concrete, and glass. This type of research might find common ground with the field of architecture, and lead to productive interdisciplinary investigations.

A third source of variation would come from the other cars. Analyzing this source of variation would lead to an investigation very similar to the roadside video camera inquiry mentioned above. Indeed, if the roadside video researchers are successful, it should be possible for the taxi cab video researchers to reuse many of their results. In this way, researchers can proceed in a virtuous circle, where each new advance facilitates the next line of study.

1.5 Sampling and Simulation

Sampling is a technique whereby one uses a statistical model to generate a data set that is “typical” of it. For example, imagine one knows that the distribution of heights in a certain population is a Gaussian with a mean of 175 cm and a standard deviation of 10 cm. Then by sampling from a Gaussian distribution with these parameters, one obtains a set of numbers that are similar to what might be observed if some actual measurements were done. Most of the data would cluster in the 165-185 cm range, and it would be extremely rare to observe a sample larger than 205 cm.

The idea of sampling suggests a useful technique for determining the quality of a statistical model: one samples from the model, and compares the sample data to the real data. If the sample data looks nothing like the real data, then there is a flaw in the model. In the case of one-dimensional numerical data this trick is not very useful. But if the data is complex and high-dimensional, and humans have a good understanding of its real structure, the technique can be quite powerful. As an example of this, consider the following two batches of pseudo-words:

a abangivesery ad allars ambed amyorsagichou an and anendouathin anth ar as at ate atompasey averean cath ce d dea dr e ed eeaind eld enerd ens er evedof fod fre g gand gho gisponeshe greastoreta har has haspy he heico ho ig iginse ill ilyo in ind io is ite iter itwat ju k le lene lilollind lliche llkee ly mang me mee mpichmm n nd nder ng ngobou nif nl noved o ond onghe oounin oreengst otaserethe oua ptrathe r rd re reed reroved sern sinttlof suikngmm t tato tcho te th the toungsshes ver wit y ythe

a ally anctyough and andsaid anot as aslatay astect be beeany been bott bout but camed chave comuperain deas dook ed eveny fel filear firgut for fromed gat gin give givesed got ha hard he hef her heree hilpte hoce hof ierty imber in it jor like lo lome lost mader mare mise moread od of om ome onertelf our out over owd pass put qu rown says seectusier seeked she shim so soomereand sse such tail the thingse tite to tor tre tro uf ughe umily upeeperlyses upoid was wat we were wers whith wird wirt with wor

These words were created by sampling from two different models of the conditional probability of a letter given a history of preceding letters. The variable stands for the th letter of the word. To produce a word, one obtains the first letter by sampling from the unconditional distribution . Then one samples from to produce the second letter, and so on. A special word-ending character is added to the alphabet, and when this character is drawn, the word is complete.

The two models were both constructed using a large corpus of English text. The first model is a simplistic bigram model, where the probability of a letter depends only on the immediately preceding letter. The second model is an enhanced version of the bigram model, which uses a refined statistical characterization of English words, that incorporates, for example, the fact that it is very unlikely for a word to have no vowel. Most people will agree that the words from the second set are more similar to real English words (indeed, several of them are real words). This perceptual assessment justifies the conclusion that the second model is in some sense superior to the first model. Happily, it turns out that the second model also achieves a better compression rate than the first model, so the qualitative similarity principle agrees with the quantitative compression principle. While the second model is better than the first, it still contains imperfections. One such imperfection relates to the word “sse”. The double-s pattern is common in English words, but it is never used to begin a word. It should be possible to achieve improved compression rates by correcting this deficiency in the model.

All compressors implicitly contain a statistical model, and it is easy to sample from this model. To do so one simply generates a random bit string and feeds it into the decoder. Unless the decoder is trivially suboptimal, it will map any string of bits to a legitimate outcome in the original data space. This perspective provides a nice interpretation of what compression means. An ideal encoder maps real data to perfectly random bit strings, and the corresponding decoder maps random bit strings to real data.

1.5.1 Veridical Simulation Principle of Science

Modern video games often attempt to illustrate scenes involving complex physical processes, such as explosions, light reflections, or collisions between nonrigid bodies (e.g. football players). In order to make these scenes look realistic, video game developers need to include “physics engines” in their games. A physics engine is a program that simulates various processes using the laws of physics. If the physics used in the simulators did not correspond to real physics, the scenes would look unrealistic: the colliding players would fall too slowly, or the surface of a lake would not produce an appropriate reflection.

This implies that there is a connection between scientific theories and veridical simulation. Can this principle be generalized? Suspend disbelief for a moment and imagine that, perhaps as a result of patronage from an advanced alien race, humans had obtained computers before the development of physics. Then scientists could conduct a search for a good theory of mechanics using the following method. First, they would write down a new candidate theory. Then they would build a simulator based on the theory, and use the simulator to generate various scenes, such athletes jumping, rocks colliding in mid-air, and water spurting from fountains. The new theory would be accepted and the old champion discarded if the former produced more realistic simulations than the latter.

As a more plausible example, consider using the simulation principle to guide an inquiry into the rules of grammar and linguistics. Here the researchers write down candidate theories of linguistics, and use the corresponding simulator to generate sentences. A new theory is accepted if the sentences it generates are more realistic and natural than those produced by the previous champion theory. This is actually very similar to Chomsky’s formulation of the goal of generative grammar; see Chapter 4 for further discussion.

This notion of science appears to meet many of the requirements of empirical science discussed previously in the chapter. It provides a solution to the Problem of Demarcation: a theory is scientific if it can be used to build a simulation program for a particular phenomenon. It gives scientists a way to make decisive theory comparisons, allowing them to search efficiently through the space of theories. It involves a kind of Circularity Commitment: one develops theories of a certain phenomenon in order to be able to construct convincing simulations of the same phenomenon. Sophie could plausibly have answered the shaman’s critique of physics by demonstrating that a simulator based on Newtonian mechanics produces more realistic image sequences than one based on shamanic revelation.

In comparison to the compression principle, the veridical simulation principle has one obvious disadvantage: theory comparisons depend on qualitative human perception. If the human observers have no special ability to judge the authenticity of a particular simulation, the theory comparisons will become noisy and muddled. The method may work for things like basic physics, text, speech, and natural images, because humans have intimate knowledge of these things. But it probably will not work for phenomena which humans do not encounter in their everyday lives.

The advantage of the simulation principle compared to the compression principle is that it provides an indication of where and in what way a model fails to capture reality. The word sampling example above showed how the model failed to capture the fact that real English words do not start with a double-s. If a model of visual reality were used to generate images, the unrealistic aspects of the resulting images will indicate the shortcomings of the model. For example, if a certain model does not handle shadows correctly, this will become obvious when it produces an image of a tree that casts no shade. The compression principle does not provide this kind of indication. For this reason, the simulation principle can be thought of as a natural complement to the compression principle, that researchers can use to find out where to look for further progress.

Another interesting aspect of the veridical simulation principle is that it can be use to define a challenge similar to the Turing Test. In this challenge, researchers attempt to build simulators that can produce samples that are veridical enough to fool humans into thinking they are real. The outcome of the context is determined by showing a human judge two data objects, one real and one simulated. The designers of the system win if the human is unable to tell which object is real.

To see the difficulty and interest of this challenge, consider using videos obtained in the course of the Visual Manhattan Project inquiry of Section 1.4.3 as the real world component. The statistical model of the video data would then need to produce samples that are indistinguishable from real footage of the streets of New York City. The model would thus need to contain all kinds of information and detail relating to the visual environment of the city, such as the layout and architecture of the buildings, and the fashion sense and walking style of the pedestrians. This is, of course, exactly the kind of information needed to compress the video data. This observation provides further support for the intuitive notion that while the simulation principle and the compression principle are not identical, they are at least strongly aligned.

It will require an enormous level of sophistication to win the simulation game, especially if the judges are long term inhabitants of New York. A true New Yorker would be able to spot very minor deviations from veridicality, related to things like the color of the sidewalk carts used by the pretzel and hot dog vendors, or to subtle changes in the style of clothing worn by denizens of different parts of the city. A true New Yorker might be able to spot a fake video if it failed to include an appropriate degree of strangeness. New York is no normal place and a real video strream will reflect that by showing celebrities, business executives, beggars, transvestites, fashion models, inebriated artists, and so on. In spite of this difficulty, the alignment between the compression and simulation principles suggest that there is a simple way to make systematic progress: get more and more video data, and improve the compression rate.

1.6 Comparison to Physics

Physics is the exemplar of empirical science, and many other fields attempt to imitate it. Some researchers have deplored the influence of so-called “physics envy” on fields like computer vision and artificial intelligence [12]. This book argues that there is nothing wrong with imitating physics. Instead, the problem is that previous researchers failed to understand the essential character of physics, and instead copied its superficial appearance. The superficial appearance of physics is its use of sophisticated mathematics; the essential character of physics is its obsession with reality. A physicist uses mathematics for one and only one reason: it is useful in describing empirical reality. Just as physicists do, comperical researchers adopt as their fundamental goal the search for simple and accurate descriptions of reality. They will use mathematics, but only to the extent that it is useful in achieving the goal.

Another key similarity between physics and comperical science involves the justification of research questions. Some skeptics may accept that CRM research is legitimate science, but believe that it will be confined to a narrow set of technical topics. After all, the CRM defines only one problem: large scale lossless data compression. But notice that physics also defines only one basic problem: given a particular physical configuration, predict its future evolution. Because there is a vast number of possible configurations of matter and energy, this single question is enormously productive, justifying research into such diverse topics as black holes, superconductivity, quantum dots, Bose-Einstein condensates, the Casimir effect, and so on. Analogously, the single question of comperical science justifies a wide range of research, due to the enormous diversity of empirical regularities that can be found in databases of natural images, text, speech, music, etc. The fact that a single question provides a parsimonious justification for a wide range of research is actually a key advantage of the philosophy.

Both physics and comperical science require candidate theories to be tested against empirical observation using hard, quantitative evaluation methods. However, there is an important difference in the way the theory-comparisons work. Physical theories are very specific. In physics, any new theory must agree with the current champion in a large number of cases, since the current champion has presumably been validated on many configurations. To adjudicate a theory contest, researchers must find a particular configuration in which the two theories make opposing predictions, and then run the appropriate experiment. In comperical science, the predictions made by the champion theory are neither correct or incorrect, they are merely good. To unseat the champion theory, it is sufficient for a rival theory to make better predictions on average.

2.1 Machine Learning

Humans have the ability to develop amazing skills relating to a very broad array of activities. However, almost without exception, this competence not innate, and is achieved only as a result of extended learning. The field of machine learning takes this observation as its starting point. The goal of the field is to develop algorithms that improve their performance over time by adapting their behavior based on the data they observe.

The field of machine learning appears to have achieved significant progress in recent years. Researchers produce a steady stream of new learning systems that can recognize objects, analyze facial expressions, translate documents from one language to another, or understand speech. In spite of this stream of new results, learning systems still have frustrating limitations. Automatic translations systems often produce gibberish, and speech recognition systems often cause more annoyance than satisfaction. One particularly glaring illustration of the limits of machine learning came from a “racist” camera system that was supposed to detect faces, but worked only for white faces, failing to detect black ones [106]. The gap between the enormous ambitions of the field and its present limitations indicates that there is some mountainous conceptual barrier impeding progress. Two views can be articulated regarding the nature of this barrier.

According to the first view, the barrier is primarily technical

in nature. Machine learning is on a promising trajectory that will ultimately allow it to achieve its long sought goal. The field is asking the right questions; success will be achieved by improving the answers to those questions. The limited capabilities of current learning systems reflect limitations or inadequacies of modern theory and algorithms. While the modern mathematical theory of learning is advanced, it is not yet advanced enough. In time, new algorithms will be found that are far more powerful than current algorithms such as AdaBoost and the Support Vector Machine 

[37, 118]. The steady stream of new theoretical results and improved algorithms will eventually yield a sort of grand unified theory of learning which will in turn guide the development of truly intelligent machines.

In the second view, the barrier is primarily philosophical in nature. In this view, progress in machine learning is tending toward a sort of asymptotic limit. The modern theory of learning provides a comprehensive answer to the problem of learning as it is currently formulated. Algorithms solve the problems for which they are designed nearly as well as is theoretically possible. The demonstration of an algorithm that provides an improved convergence rate or a tighter generalization bound may be interesting from an intellectual perspective, and may provide slightly better perfomance on the standard problems. But such incremental advances will produce true intelligence. To achieve intelligence, machine learning systems must make a discontinuous leap to an entirely new level of performance. The current mindset is analogous to the researchers in the 1700s who attempted to expedite ground transportation by breeding faster horses, when they should actually have been searching for qualitatively different mode of transportation. The problem, then, is in the philosophical foundations of the field, in the types of questions considered by its practitioners and their philosophical mindset. If this view is true, then to make further progress in machine learning, it is necessary to formulate the problem of learning in a new way. This chapter presents arguments in favor of the second view.

2.1.1 Standard Formulation of Supervised Learning

There are two primary modes of statistical learning: the supervised mode and the unsupervised mode. The present discussion will focus primarily on the former; the latter is discussed in Appendix B. The supervised version can be understood by considering a typical example of what it can do. Imagine one wanted to build a face detection system capable of determining if a digital photo contains an image of a face. To use a supervised learning method, the researcher must first construct a labeled dataset, which is made up of two parts. The first part is a set of images . The second part is a set of binary labels , which indicate whether or not a face is present in each image. Once this database has been built, the researcher invokes the learning algorithm, which attempts to obtain a predictive rule such that . Below, this procedure is refered to as the “canonical” form of the supervised learning problem. Many applications can be formulated in this way, as shown in the following list:

  • Document classification: the data are the documents, and the data are category labels such as “sports”, “finance”, “political”, etc.

  • Object recognition: the data are images, and the data are object categories such as “chair”, “tree”, “car”, etc.

  • Electoral prediction: each data is a package of information relating to current political and economic conditions, and the is a binary label which is true if the incumbent wins.

  • Marital satisfaction: each data is a package of vital statistics relating to particular marriage (frequency of sex, frequency of argument, religious involvement, education levels, etc) and the corresponding is a binary label which is true if the marriage ends in divorce.

  • Stock Market prediction: each is a set of economic indicators such as interest rates, exchange rates, and stock prices for a given day; the is the change in value of a particular stock on the next day.

2.1.2 Simplified Description of Learning Algorithms

For readers with no background in machine learning, the following highly simplified description should provide a basic understanding of the basic ideas. One starts with a system that performs some task, and a method for evaluating the performance provides on the task. Let this evaluation function be denoted as , and be the performance of the system . In terms of the canonical task mentioned above, the system is the predictive rule , and the evaluation function is just the squared difference between the predictions and the real data:

A key property of the system is that it be mutable. If a system is mutable, then a small perturbation will produce a new system that behaves in nearly the same way as . This mutability requirement prevents one from defining to be, for example, the code of a computer program, since a slight random change to a program will usually break it completely. To construct systems that can withstand these minor mutations without suffering catastrophic failures, researchers often construct the system by introducing a set of numerical parameters . If the behavior of changes smoothly with changes in , then small changes to the system can be made by making small changes to . There are, of course, other ways to construct mutable systems. Given a mutable system and an evaluation function, then the following procedure can be used to search for a high-performance system:

  1. Begin by setting , where is some default setup (which can be naïve).

  2. Introduce a small change to , producing .

  3. If , keep the change by setting . Otherwise, discard the modified version.

  4. Return to step #2.

Many machine learning algorithms can be understood as refined versions of the above process. For example, the backpropagation algorithm for the multilayer perceptron uses the chain rule of calculus to find the derivative of the

with respect to  [98]

. Many reinforcement learning algorithms work by making smart changes to a policy that depends on the parameters


. Genetic algorithms, which are inspired by the idea of natural selection, also roughly follow the process outlined above.

2.1.3 Generalization View of Learning

Machine learning researchers have developed two conceptual perspectives by which to approach the canonical task. The first and more popular perspective is called the Generalization View. Here the goal is to obtain, on the basis of the limited -sample data set, a model or predictive rule that works well for new, previously unseen data samples. The Generalization View is attractive for obvious practical purposes: in the case of the face detection task, for example, the model resulting from a successful learning process can be used in a system which requires the ability to detect faces in previously unobserved images (e.g. a surveillance application). The key challenge of the Generalization View is that the real distribution generating the data is unknown. Instead, one has access to the empirical distribution defined by the observed data samples.

In the early days of machine learning research, many practitioners thought that the sufficient condition for a model to achieve good performance on the real distribution was that it achieved good empirical performance: it performed well on the observed data set. However, they often found that their models would perform very well on the observed data, but fail completely when applied to new samples. There were a variety of reasons for this failure, but the main cause was the phenomenon of overfitting. Overfitting occurs when a researcher applies a complex model to solve a problem with a small number of data samples. Figure 2.1 illustrates the problem of overfitting. Intuitively, it is easy to see that when there are only five data points, the complex curve model should not be used, since it will probably fail to generalize to any new points. The linear model will probably not describe new points exactly, but it is less likely to be wildly wrong. While intuition favors the line model, it is not immediately obvious how to formalize that intuition: after all, the curve model achieves better empirical performance (it goes through all the points).

Figure 2.1: Illustration of the idea of model complexity and overfitting. In the limited data regime situation depicted on the left, the line model should be preferred to the curve model, because it is simpler. In the large data regime, however, the polynomial model can be justified.

The great conceptual achievement of statistical learning is the development of methods by which to overcome overfitting. These methods have been formulated in many different ways, but all articulations share a common theme: to avoid overfitting, one must penalize complex models. Instead of choosing a model solely on the basis of its empirical performance, one must optimize a tradeoff between the empirical performance and model complexity. In terms of Figure 2.1, the curve model achieves excellent empirical performance, but only because it is highly complex. In contrast the line model achieves a good balance of performance and simplicity. For that reason, the line model should be preferred in the limited-data regime. In order to apply the complexity penalty strategy, the key technical requirement is a method for quantifying the complexity of a model.

Once a suitable expression for a model’s complexity is obtained, some further derivations yield a type of expression called a generalization bound. A generalization bound is a statement of the following form: if the empirical performance of the model is good, and the model is not too complex, then with high probability its real performance will be only slightly worse. The caveat “with high probability” can never be done away with, because there is always some chance that the empirical data is simply a bizarre or unlucky sample of the real distribution. One might conclude with very high confidence that a coin is biased after observing 1000 heads in a row, but one could never be completely sure.

While most treatments of model complexity and generalization bounds require sophisticated mathematics, the following simple theorem can illustrate the basic ideas. The theorem can be stated in terms of the notation used for the canonical task of supervised learning mentioned above. Let be a set of hypotheses or rules that take a raw data object as an argument and output a prediction of its label . In terms of the face detection problem, would be an image, and would be a binary flag indicating whether the image contains a face. Assume it is possible to find a hypothesis that agrees with all the observed data:

Now select some and such that the following inequality holds:


Then with probability , the error rate of the hypothesis will be at most when measured against the real distribution. Abstractly, the theorem says that if the hypothesis class is not too large compared to the number of data samples, and some element achieves good empirical performance, then with high probability its performance on the real (full) distribution will be not too much worse. The following informal proof of the theorem may illuminate the core concept.

To understand the theorem, imagine you are searching through a barrel of apples (the hypotheses), looking for a good one. Most of the apples are “wormy”: they have a high error rate on the real distribution. The goal is to find a ripe, tasty apple; one that has a low error rate on the real distribution. Fortunately, most of the wormy apples can be discarded because they are visibly old and rotten, meaning they make errors on the observed data. The problem is that there might be a “hidden worm” apple that looks tasty - it performs perfectly on the observed data - but is in fact wormy. Define a wormy apple as one that has real error rate larger than . Now ask the question: if an apple is wormy, what is the probability it looks tasty? It’s easy to find an upper bound for this probability:

This is because, if the apple is wormy, the probability of not making a mistake on one sample is , so the probability of not making a single mistake on samples is . Now the question is: what is the probability that there are no hidden worms in the entire hypothesis class? Let be the event that the th apple is a hidden worm. Then the probability that there are no hidden worms in the hypothesis class is:

The first step is true because , the second step is true because , the third step is true because there are hypotheses, and the final step is just a substitution of Inequality 2.1.3. Then the result follows by letting , noting that , and rearranging terms.

A crucial point about the proof is that it makes no guarantee whatever that a good hypothesis (tasty worm-free apple) will actually appear. The proof merely says that, if the model class is small and the other values are reasonably chosen, then it is unlikely for a hidden worm hypothesis to appear. If the probability of a hidden worm is low, and by chance a shiny apple is found, then it is probable that the shiny apple is actually worm-free.

A far more sophisticated development of the ideas of model complexity and generalization is due to the Russian mathematician Vladimir Vapnik [116]*. In Vapnik’s formulation the goal is to minimize the real (generalization) risk , which can be the error rate or some other function. Vapnik derived a sophisticated model complexity term called the VC dimension, and used it to prove several generalization bounds. A typical bound is:

Where is the real risk of hypothesis and is the empirical risk, calculated from the observed data. The bound, which holds for all hypotheses simultaneously, indicates the conditions under which the real risk will not exceed the empirical risk by too much. As above, the bound holds with probability , and is the number of data samples. The term is the log size of the VC-dimension, which plays a conceptually similar role to the simple term in the previous theorem. Vapnik’s complex inequality shows the same basic idea as the simple theorem above: the real performance will be good if the empirical performance is good and the log size of the hypothesis class is small in comparison with the number of data samples. Proofs of theorems in the VC theory also use a similar strategy: show that if the model class is small, it is unlikely that it includes a “hidden worm” hypothesis which has low empirical risk but high real risk. Also, none of the VC theory bounds guarantee that a good hypothesis (low ) will actually be found.

The problem of overfitting is easily understood in the light of these generalization theorems. A naïve approach to learning attempts to minimize the empirical risk without reference to the complexity of the model. The theorems show that a low empirical risk, by itself, does not guarantee low real risk. If the model complexity terms and are large compared to the number of samples , then the bounds will become too loose to be meaningful. In other words, even if the empirical risk is reduced to a very small quantity, the real risk may still be large. The intuition here is that because such a large number of hypotheses was tested, the fact that one of them performs well on the empirical data is meaningless. If the hypothesis class is very large, then some hypotheses can be expected to perform well merely by chance.

The above discussion seems to indicate that complexity penalties actually apply to model classes, not to individual models. There is an important subtlety here. In both of the generalization theorems mentioned above, all elements of the model class were treated equally, and the penalty depended only on the size of the class. However, it is also reasonable to apply different penalties to different elements of a class. Say the class contains two subclasses and . Then if , hypotheses drawn from must receive a larger penalty, and therefore require relatively better empirical performance in order to be selected. For example, in terms of Figure 2.1, one could easily construct an aggregate class that includes both lines and polynomials. Then the polynomials would receive a larger penalty, because there are more of them.

While more complex models must receive larger penalties, they are never prohibited outright. In some cases it very well may be worthwhile to use a complex model, if the model is justified by a large amount of data and achieves good empirical performance. This concept is illustrated in Figure 2.1: when there are hundreds of points that all fall on the complex curve, then it is entirely reasonable to prefer it to the line model. The generalization bounds also express this idea, by allowing or to be large if is also large.

2.1.4 Compression View

The second perspective on the learning problem can be called the Compression View. The goal here is to compress a data set to the smallest possible size. This view is founded upon the insight, drawn from information theory, that compressing a data set to the smallest possible size requires the best possible model of it. The difficulty of learning comes from the fact that the bit cost of the model used to encode the data must itself be accounted for. In the statistics and machine learning literature, this idea is known as the Minimum Description Length (MDL) principle [120, 95].

The motivation for the MDL idea can best be seen by contrasting it to the Maximum Likelihood Principle, one of the foundational ideas of statistical inference. Both principles apply to the problem of how to choose the best model out of a class to use to describe a given data set . For example, the model class could be the set of all Gaussian distributions, so that an element would be a single Gaussian, defined by a mean and variance. The Maximum Likelihood Principle suggests choosing so as to maximize the likelihood of the data given the model:

This principle is simple and effective in many cases, but it can lead to overfitting. To see how, imagine a data set made up of 100 numbers . Let the class

be the set of Gaussian mixture models. A Guassian mixture model is just a sum of normal distributions with different means and variances. Now, one simple model for the data could be built by finding the mean and variance of the

data and using a single Gaussian with the given parameters. A much more complex model can be built by taking a sum of 100 Gaussians, each with mean equal to some and near-zero variance. Obviously, this “comb” model is worthless: it has simply overfit the data and will fail badly when a new data sample is introduced. But it produces a higher likelihood than the single Gaussian model, and so the Maximum Likelihood principle suggests it should be selected. This indicates that the principle contains a flaw.

The Minimum Description Length principle approaches the problem by imagining the following scenario. A sender wishes to transmit a data set to a receiver. The two parties have agreed in advance on the model class . To do the transmission, the sender chooses some model and sends enough information to specify to the receiver. The sender then encodes the data using a code based on . The best choice for minimizes the net codelength required:

Where is the bit cost of specifying to the receiver, and is the cost of encoding the data given the model. If if it were not for the term, the MDL principle would be exactly the same as the Maximum Likelihood principle, since maximizing is the same as minimizing . The use of the term penalizes complex models, which allows users of the MDL principle to avoid overfitting the data. In the example mentioned above, the Gaussian mixture model with 100 components would be strongly penalized, since the sender would need to transmit a mean/variance parameter pair for each component.

The MDL principle can be applied to the canonical task by imagining the following scenario. A sender has the image database and the label database , and wishes to transmit the latter to a receiver. A crucial and somewhat counterintuitive point is that the receiver already has the image database . Because both parties have the image database, if the sender can discover a simple relationship between the images and the labels, he can exploit that relationship to save bits. If a rule can be found that accurately predicts given , that is to say if a good model can be obtained, then the label data can be encoded using a short code. However, in order for the receiver to be able to perform the decoding, the sender must encode and transmit information about how to build the model. More complex models will increase the total number of bits that must be sent. The best solution, therefore, comes from optimizing a tradeoff between empirical performance and model complexity.

2.1.5 Equivalence of Views

The Compression View and the Generalization View adopt very different approaches to the learning problem. Profoundly, however, when the two different goals are formulated quantitatively, the resulting optimization problems are quite similar. In both cases, the essence of the problem is to balance a tradeoff between model complexity and empirical performance. Similarly, both views justify the intuition relating to Figure 2.1 that the linear model should be preferred in the low-data regime, while the polynomial model should be preferred in the high-data regime.

The relationship between the two views can be further understood in the context of the simple hidden worm theorem described above. As stated, this theorem belongs to the Generalization View. However, it is easy to convert it into a statement of the Compression View. A sender wishes to transmit to a receiver a database of labels which are related to a set of raw data objects. The receiver already has the raw data . The sender and receiver agree in advance on the hypothesis class and an encoding format based on it that works as follows. The first bit is a flag that indicates whether a good hypothesis was found. If so, the sender then sends the index of the hypothesis in , using bits. The receiver can then look up and apply it to the images to obtain the labels . Otherwise, the sender encodes the labels normally at a cost of bits. This scheme achieves compression under two conditions: a good hypothesis is found and is small compared to the number of samples . These are exactly the same conditions required for generalization to hold in the Generalization View approach to the problem.

This equivalence in the case of the hidden worm theorem could be just a coincidence. But in fact there are a variety of theoretical statements in the statistical learning literature that suggest that equivalence is actually quite deep. For example, Vapnik showed that if that if a model class could be used to compress the label data, then the following inequality relates the achieved compression rate to the generalization risk :


The second term on the right, , is for all practical cases small compared to the term, so this inequality shows a very direct relationship between compression and generalization. This expression is strikingly simpler than any of the other VC generalization bounds.

There are many other theorems in the machine learning literature that suggest the equivalence of the Compression View and the Generalization View. For example, a simple result due to Blumer et al.  relates the learnability of a hypothesis class to the existence of an Occam algorithm for it. In this paper, the key question of learning is whether a good approximation can be found of the true hypothesis , when both functions are contained in a hypothesis class . If a good approximation (low ) can be found with high probability (low ) using a limited number of data samples (small ), the class is called learnable. Functions in the hypothesis class can be specified using some finite bit string; the length of this string is the complexity of the function. To define an Occam algorithm, let the unknown true function have complexity , and let there be samples . The algorithm then produces a hypothesis of complexity , that agrees with all the sample data, where and are constants. Because the complexity of grows sublinearly with , then a simple encoding scheme such as the one mentioned above based on and the Occam algorithm is guaranteed to produce compression for large enough . Blumer et al. show that if an Occam algorithm exists, then the class is learnable. More complex results by the same authors are given in LABEL:LearnabilityVapnik (see section 3.2 in particular).

While the Generalization View and the Compression View may be equivalent, the latter approach has a variety of conceptual advantages. First of all, the No Free Lunch theorem of data compression indicates that no completely general compressor can ever succeed. This shows that all approaches to learning must discover and exploit special empirical structure in the problem of interest. This fact does not seem to be widely appreciated in the machine learning literature: many papers advertise methods without explicitly describing the conditions required for the methods to work. Also, because the model complexity penalty

can be interpreted as a prior over hypotheses, the Compression View clarifies the relationship between learning and Bayesian inference. This relationship is obscure in the Generalization View, leading some researchers to claim that learning differs from Bayesian inference in some kind of deep philosophical way.

Another significant advantage of the Compression View is that it is simply easier to think up compression schemes than it is to prove generalization theorems. For example, the Generalization View version of the hidden worm theorem requires a derivation and some modest level of mathematical sophistication to determine the conditions for success, to wit, that is small compared to and a good hypothesis is found. In contrast, in the Compression View version of the theorem, the requirements for success become obvious immediately after defining the encoding scheme. The equivalence between the two views suggest that a fruitful procedure for finding new generalization results is to develop new compression schemes, which will then automatically imply associated generalization bounds.

2.1.6 Limits of Model Complexity in Canonical Task

In the Compression View, an important implication regarding model complexity limits in the canonical task is immediately clear. The canonical task is approached by finding a short program that uses the image data set to compress the label data . The goal is to minimize the net codelength of the compressor itself plus the encoded version of . This can be formalized mathematically as follows:

Where is the optimal model, is the codelength required to specify model , and is the model class. Now assume that contains some trivial model , and assume that . The intuition of is that it corresponds to just sending the data in a flat format, without compressing it at all. Then, in order to justify the choice of over , it must be the case that:

The right hand side of this inequality is easy to estimate. Consider a typical supervised learning task where the goal is to predict a binary outcome, and there are data samples (many such problems are studied in the machine learning literature, see the review [30]). Then a dumb format for the labeled data simply uses a single bit for each outcome, for a total of bits. The inequality then immediately implies that:

This puts an absolute upper bound on the complexity of any model that can ever be used for this problem. In practice, the model complexity must really be quite a bit lower to get good results. Perhaps the model requires 200 bits to specify and the encoded data requires 500 bits, resulting in a savings of 300 bits. 200 bits corresponds to 25 bytes. It should be obvious to anyone who has ever written a computer program that no model of any complexity can be specified using only 25 bytes.

2.1.7 Intrinsically Complex Phenomena

Consider the following thought experiment. Let be some database of interest, made up of a set of raw data objects. Let be the set of programs that can be used to losslessly encode . A program is an element of , its length is . Furthermore let be the codelength of the encoded version of produced by . For technical reasons, assume also that the compressor is stateless, so that the can be encoded in any order, and the codelength for each object will be the same regardless of the ordering. This simply means that any knowledge of the structure of must be included in at the outset, and not learned as a result of analyzing the the . Now define:

So the quantity is the shortest codelength for that can be achieved using a model of length less than . This quantity cannot actually be found, for basically the same reason that the Kolmogorov complexity of a string cannot be computed. But set this fact aside for a moment and consider what the graph of would look like if somehow it could be calculated.

Clearly, the shape of the function depends on the characteristics of the data set . If is made up of completely random noise, then will be a flat line, because random data cannot be compressed. On the other hand, consider the case of constructing by running many trials of a simple physics experiment, involving some basic process such as electric currents or ballistic motion. In that case a very short program encoding the relevant physical laws such as would achieve a good compression rate, so would drop substantially for small .

A third type of dataset would fall in between these two extremes. In this case, would decline only gradually: good compression rates could be achieved, but only for large . Such a dataset would represent an intrinsically complex phenomenon. This kind of phenomenon could be understood and predicted well, but only with an highly complex model (large ). Because cannot actually be computed, it is impossible to prove that any given naturally occuring dataset is complex. But common sense suggests complex data sets exist, and may be common. The following quote from an interview with Vladimir Vapnik provides further support for the proposition:

I believe that something drastic has happened in computer science and machine learning. Until recently, philosophy was based on the very simple idea that the world is simple. In machine learning, for the first time, we have examples where the world is not simple. For example, when we solve the “forest” problem (which is a low-dimensional problem) and use data of size 15,000 we get 85%-87% accuracy. However, when we use 500,000 training examples we achieve 98% of correct answers. This means that a good decision rule is not a simple one, it cannot be described by a very few parameters. This is actually a crucial point in approach to empirical inference.
This point was very well described by Einstein who said “when the solution is simple, God is answering”. That is, if a law is simple we can find it. He also said “when the number of factors coming into play is too large, scientific methods in most cases fail”. In machine learning we are dealing with a large number of factors. So the question is: what is the real world? Is it simple or complex? Machine learning shows that there are examples of complex worlds [117]*.

The forest dataset mentioned by Vapnik is part of the UCI machine learning repository [2], a set of benchmark problems widely used in the field. It is one of the largest datasets in the repository. Most of the other datasets are much smaller: there are many with less than 1000 samples, and only a few with more than 10000. This scarcity of data is caused by the simple fact that generating labelled datasets tends to require a substantial amount of labor. For example, one of the problems in the repository requires the algorithm to guess a person’s income based on factors such as education level, occupation, age, and marital status. In order to produce a single valid data point for this problem, a person must fill out a lengthy questionnaire.

There are many standard learning problems, in the UCI and elsewhere, for which good performance cannot be achieved. Researchers often conclude that this is due to the intrinsic difficulty of the problem. But the results reported by Vapnik on the Forest problem suggest a new diagnosis: perhaps the problems can be solved well, but only by using a complex model. Many of the problems in the UCI repository include a relatively small number of samples; most have less than 15,000, and some have less than 1,000. In this low-data regime, it is impossible to use a complex model without overfitting, so these problems cannot be solved well.

This new diagnosis is part of a larger analysis of the limitations of supervised learning. Supervised learning appears to be based on the assumption that the world, and the problems it contains, are simple. If the world is simple, then simple models can describe it well. Since simple models can be justified based on small labeled data sets, the empirical content of such data sets is sufficient to understand the world. But the idea of intrinsically complex phenomena casts doubt on the assumption of the simplicity of the world. If such phenomena exist, simple models will be inadequate to describe them well. In that case it is essential to obtain large data sets, on the basis of which complex models can be justified.

This analysis may not be very satisfying, because building large labeled data sets is an expensive and time-consuming chore. But that is only because of the necessity of using human intelligence to provide labels for the data. In contrast to labeled data, unlabeled data is easy to acquire in large quantities. Furthermore, raw data objects (e.g. images) have far larger information content than the labels. This suggests that, in order to construct complex models and thereby understand the complex world, it is necessary to find a way to exploit the vast information content of raw unlabeled data sets. The compression principle provides exactly such a mechanism.

2.1.8 Comperical Reformulation of Canonical Task

The Compression Rate Method suggests a qualitatively different approach to the canonical task of supervised learning. The new approach is to separate the learning process into two phases. In the first phase, the goal is to learn a model of the raw data objects themselves. In the face detection example, the first phase involves a study of face images. The results of the first phase are reused in the second phase to find the relationship between the data objects and the labels.

This line of attack totally changes the nature of the problem by vastly expanding the amount of data being modeled. A typical digital images might have about pixels, each of 8 bits apiece, for a total of about . If there are images, then the total information content of the database is about . This enormous expansion in the amount of data being modeled justifies a correspondingly huge increase in the complexity of the models used. For example, one can easily justify a model requiring bits to encode (larger by far than the models used in traditional statistical learning), by showing that it achieves a savings of bits on the large database.

Another vast increase in the quantity of data available comes from the realization that it is no longer necessary to use labeled data. A major bottleneck in supervised learning research is the difficulty of constructing labeled databases, which require significant expenditures of human time and effort. The MNIST handwritten digit recognition database, which is one of the largest in common use, contains labeled samples [65]. In contrast, the popular web image sharing site recently announced that the size of its image database had reached  [123].

The beginning of this chapter suggested that algorithms like AdaBoost and the Support Vector Machine solve the problems for which they were designed nearly as well as is theoretically possible. A skeptic could argue that this claim is absurd, because there are many problems which the Support Vector Machine cannot solve, but the human brain can. A comperical researcher would respond that in fact the human brain is solving a different problem. The human brain learns from a vast data set containing visual, audial, linguistic, olfactory, and tactile components. This allows the brain to circumvent the model complexity limitations inherent in the supervised learning formulation, and obtain a highly sophisticated model. The brain then reuses components from this complex model to handle every specific learning problem, such as object recognition and face detection. If the brain had to solve a limited data classification problem for which none of its prelearned components were relevant, it would fare just as badly as a typical machine learning algorithm.

It may not be obvious why having a good model of will assist in learning . The key is to think in terms of description languages, or compression formats, rather than probability distributions. A good compression format for real world images will involve elements corresponding to abstractions such as “face”, “person”, “tree”, “car”, and so on. Transforming the image from the raw, pixel-based representation into the abstraction-based representation should considerably simplify the problem of finding the relationship between image and label. The face detection problem becomes trivial if the abstract representation contains a “face” element.

Similar articulations of the idea that more complex models can be justified when modeling raw data have appeared in the machine learning literature. For example, Hinton et al. note that:

Generative models can learn low-level features without requiring feedback from the label, and they can learn many more parameters than discriminative models without overfitting. In discriminative learning, each training case constrains the parameters only by as many bits of information as are required to specify the label. For a generative model, each training case constrains the parameters by the number of bits required to specify the input [48]*.

In the same paper, the authors provide clear evidence that an indirect approach (learn , then ) can produce better results than a purely direct approach (learn directly). Note that a generative model is exactly one half of a compression program: the decoder component. Hinton develops the generative model philosophy at greater length in [46]*. Hinton’s philosophy is one of the key influences on this book; see the Related Work section for further discussion.

The comperical approach also bears some similarity to the family of methods called unsupervised learning. The goal of unsupervised learning is to discover useful structure in collections of raw data objects with no labels. Unlike supervised learning which can be fairly clearly defined, unsupervised learning is an extremely broad area, making comparisons difficult. Roughly, there are three qualities that make the comperical approach different. First, there is no standard quantity that all unsupervised learning researchers attempt to optimize; this makes it difficult to compare competing solutions. Comperical researchers focus exclusively on the compression rate, allowing strong comparisons. Second, unsupervised learning methods are advertised as general purpose and widely applicable, whereas comperical methods are targeted at specific datasets and emphasize the

empirical character of the research. Similarly, comperical research calls for the construction of large specialized data sets, such as the Visual Manhattan Project. Further discussion is contained in the Related Work appendix.

2.2 Manual Overfitting

The above discussion involves the problem of overfitting, and the ways in which machine learning methods can be used to prevent overfitting. There is, however, another conceptual issue that machine learning researchers must face, which is in some sense even more subtle than overfitting. This can be called “manual overfitting”, and is illustrated by the following thought experiment.

2.2.1 The Stock Trading Robot

Sophie’s brother Thomas also studied to be a physicist, but after receiving his doctorate he decides that he would rather be rich than tenured, and gets a job as a quantitative analyst at an investment bank. He joins a team that is working on building an automated trading system. Thomas receives a bit of training in financial engineering and machine learning. When his training is complete, he is given a database of various economic information including interest rates, unemployment figures, stock price movements, and so on. This database is indexed by date, so it shows, for example, the change in interest rates on January 1st, the change on January 2nd, and so on. Thomas also receives a smaller, second database showing the changes in the yen-dollar exchange rates, also indexed by day. Thomas’ task is to write a program that will use the economic indicator information to predict the changes in the yen-dollar exchange rate.

Thomas, as a result of growing up in his sister’s intellectual shadow, and in spite of having a PhD. in physics from a prestigious university, has never been confident in his own abilities. He has read about the generalization theorems of machine learning, but is still puzzled by some of the more esoteric mathematics. So, he decides to attempt to solve the prediction problem using the simplest possible generalization theorem, the Hidden Worm theorem mentioned above.

Thomas finds it easy to formulate economic prediction problem in terms of the canonical task of supervised learning. The data is the batch of economic indicators for a given day. The corresponding data is a binary label that is true or false if the yen-dollar exchange goes up on the following day. Thomas builds a hypothesis class by writing down a family of simple rules, and forming hypotheses by combining the rules together in various ways. Each rule is just a binary function, which returns true if a given economic indicator exceeds a certain threshold. So, for example, a rule might be true if the change in Treasury bill yields was greater than 5% on a certain day. The specific value of the threshold is a parameter; at first, Thomas uses different choices for possible threshold values. There are also economic indicators, so the number of rules is . Somewhat arbitrarily, Thomas decides to form hypotheses by combining basic rules using the logical pattern . This implies that the total size of the hypothesis class is , and the log size is given by:

Lacking any strong guidance regarding how to choose these values, Thomas decides that the rule should be in error at most 1% of the time, and he wants to be at least 95% confident that the result holds. In the terms of the Valiant theorem, these imply values of and . The two databases also contain days worth of data. Rearranging Inequality 2.1, he finds that an upper bound on the size of hypothesis class is:

This indicates that there is a problem: his original choice for the model class is too large, and so using it may cause overfitting. To deal with the problem, Thomas decides to reduce the number of thresholds from to . This new class has , so generalization should hold if a good hypothesis is found.

Thomas writes a program that will find if it exists, by testing each hypothesis against the data. He notices that a naïve implementation will require quite a bit of time to check every hypothesis, since is large in absolute terms. However, he is able to find a variety of computational tricks to expedite the computation. Finally, he completes the learning algorithm, invokes it on the data set, and leaves it to run for the night. After a night of troubled dreams, Thomas is delighted to discover, when he arrives at work the next day, that the program was successful in finding a good hypothesis . This hypothesis correctly predicts the movement of the yen-dollar exchange rate for all of the 500 days recorded in the database. Excited, he begins working on a presentation describing his result.

When Thomas presents his method and result at the next group meeting, his boss listens to the talk with a blank expression. On the final slide, Thomas states his conclusion that, with 95% probability, the hypothesis he discovered should be correct 99% of the time. The boss does not really understand what Thomas is saying, but his basic management policy is to always push his employees to work harder. So he declares that the 95% number is not good enough: “We need a better guarantee! Do it again!” Then he storms out of the room.

Thomas, at this point, is somewhat vexed. He is not sure how to respond to the boss’ demand to improve the generalization guarantee. A coworker mentions that she had been working with the database before it was given to Thomas, and she was surprised to find that it contained any predictive power at all: she thought the economic indicator information was too noisy, and unrelated to the yen-dollar exchange rate. Thomas tries the same approach a couple more times, using different kinds of hypothesis classes. Most of them fail to find a good hypothesis, but eventually he finds another class that is small enough to produce a tighter generalization probability (smaller ), and also contains a good . Relieved, he sends an email to his boss to inform him of the discovery. The boss congratulates him and promises him a big bonus if the hypothesis works out in the actual trading system.

Later on, as Thomas is reviewing his notes on the project, he finds something vaguely disturbing: the first , from the original presentation, is suspiciously similar to the second , that came from a smaller class and so produced a smaller . Indeed, on further examination, he notices that the two are almost exactly the same, except for slighty different threshold values. What can this mean? He goes over the math again, and thinks to himself: what if I had just cut the size of the original by 90%, while making sure that it still contained the good hypothesis? If goes down by 90%, that means can also be reduced by 90% while leaving the rest of the numbers unchanged. Then the second analysis will produce a much better generalization probability than the first one, even though they both use the same hypothesis!

2.2.2 Analysis of Manual Overfitting

The above thought experiment illustrates the problem of manual overfitting. Thomas attempted to use a model complexity penalty and the associated generalization theorem to guard against overfitting. But the theorem has a crucial implicit requirement: the model class must not be changed after looking at the data. By testing many different model classes and cherry-picking one that works, Thomas violated this requirement. What he has really done, in effect, is implement a learning algorithm by hand. Instead of using only the algorithm to choose the best hypothesis, Thomas performed a lot of the hypothesis-selection work himself. And this manual selection process implicitly uses a much larger hypothesis class, thus causing overfitting.

To see this more clearly, imagine Thomas decides to use 1000 model classes . He writes a program that checks each model class to determine if it contains a good hypothesis . The program finds a good in . Thomas then concludes that generalization will hold, since is small compared to . But this is nonsense, because he has effectively tested a much larger class.

The problem of manual overfittig causes deep conceptual problems for the traditional mindset of engineering and computer programming. The typical approach taken by a programmer when faced with a new task is to proceed toward a solution through a series of iterative refinements. She writes an initial prototype without too much painstaking care, runs it, and observes the errors that are produced. Then she diagnoses the errors, makes an appropriate modification to the code, and tries again. She repeats this process until there are no more errors. This process of repeated code/test/debug loops works fine for normal programming tasks. Abstractly, this kind of trial and error process is utilized by almost all successful human endeavors. But this approach just does not work for machine learning - at least, it doesn’t work unless the dataset is replenished with new samples every time the hypothesis class is changed. This is almost never done, because of the difficulty of creating labeled datasets.

The fallacy of manual overfitting becomes self-evident when the learning problem is properly phrased in terms of the Compression View. To see this, imagine that Thomas has a colleague who is departing tomorrow for Madagascar. The colleague has the large database of economic indicator information, and will take it with him. But he also needs the yen-dollar database, which has not yet been prepared. Before the departure, Thomas and his colleague agree on a compression format based on a class , which includes many rules for predicting the yen-dollar information from the rest of the economic indicators. Then the colleague leaves, and a while later, the yen-dollar database becomes available. Thomas finds a good rule , and so the format achieves a good compression rate. However, at this point, Thomas’ boss intervenes, complaining that the compression rate isn’t good enough (it is expensive to send data to Madagascar). Having looked at the data, Thomas can easily build a new compressor, based on some new model class , that achieves much better compression rates. But to use this new compressor, he would have to send it to his colleague in Madagascar, thereby defeating the whole point of the exercise.

2.2.3 Train and Test Data

Some people might argue that the problem of manual overfitting is not important, because in practice people do not report results obtained from generalization theorems. In practice researchers report actual results on so-called “test” data sets. The idea is that any benchmark database is partitioned into two subsets: the “training” data and the “test” data. Researchers are licensed to use the training data for whatever purposes they want. The usual case is to use the training data to experiment with different approaches, and to find good parameters for their statistical models. The test data, however, is supposed to be used only for reporting the actual results achieved by the algorithm. The idea is that if the learning algorithm does not have access to the test data, then the performance on the test data is a good approximation of its actual generalization ability. The training/test strategy is fraught with methodological difficulty, however, as is illustrated by the following ominous warning on the web page that hosts the Berkeley Segmentation Dataset, a well known benchmark in computer vision:

Please Note: Although this should go without saying, we will say it anyway. To ensure the integrity of results on the test data set, you may use the images and human segmentations in the training set for tuning your algorithms but your algorithms should not have access to any of the data (images or segmentations) in the test set until your are finished designing and tuning your algorithm [75].

The concern underlying this warning is that a problem analogous to manual overfitting can occur even when using the train/test strategy. Here, manual overfitting occurs when a researcher tries many different model classes, invokes a learning algorithm to select optimal hypotheses based on the training data, measures the performance of on the test data, and then selects the model class that provides the best performance. Again, this procedure gives the illusion that it prevents overfitting, because the (computer) algorithm does not observe the test data when selecting the optimal hypothesis. But what has actually happened is that the developer has implemented a learning process by hand, and that learning process “cheats” by looking at the test data.

There are several reasons why the approach advocated by the warning probably does not actually work. Most basically, the prohibition against iterative tuning and design is deeply at odds with the natural tendencies of computer programmers. If the purpose of a system is to produce good segmentations, and the quality of the produced segmentations is defined by the score on the test data, then if a system does not achieve a good test set score, it must have a bug and so require further development and refinement. Moreover, researchers are often under intense pressure to produce results. A PhD student whose thesis involves a new segmentation method may not be allowed to graduate if his algorithm does not produce state of the art results. Also, there is no way to prevent publication bias (or positive results bias). A researcher may try dozens of different algorithms; if only one of them succeeds in achieving good performance on the test data, then that is the one that gets turned into a paper and submitted. Finally, the underlying point about manual overfitting is that it does not matter whether an algorithm or a human does the hypothesis selection. If the hypothesis is selected from too large a class, overfitting will result. The idea of benchmarking is that is provides the community with a way of finding the “best” approach. But this just means that now it is the research community that does the selection, instead of an algorithm or a human.

2.2.4 Comperical Solution to Manual Overfitting

Overfitting is a subtle and insidious problem, and it took statisticians many years to fully understand it and to propose safeguards against it. Arguably, this process is still going on today. Manual overfitting is a similarly subtle problem, which the field has not yet begun to truly recognize. Perhaps future researchers will propose elegant solutions to the problem. There is, however, a direct method to prevent manual overfitting that is available today.

The solution is: pick a model class now, and use it for all future learning problems. This solution is, of course, quite drastic. If the model class selected today is of only moderate complexity, then some learning problems will be rendered insolvable simply because of prior limitations on the expressivity of the class. If, on the other hand, the model class is highly complex, it potentially includes a hypothesis that will work well for all problems of possible interest. But such a large and complex class would be extremely prone to overfitting. In order to avoid overfitting when using such a class, it would be necessary to use a huge amount of data.

The comperical answer to manual overfitting is exactly this strategy. Comperical research employs the same model class, based on a programming language, to deal with all problems. Since this model class is extremely (in fact, maximally) expressive, it is necessary to use vast quantities of data to prevent overfitting. As described above, this requirement is accommodated by modeling the raw data objects, such as images or sentences, instead of the labels. Because the raw data objects have much greater information content, overfitting can be prevented.

Notice how the comperical approach rehabilitates the natural code/test/debug loop used by programmers. Comperical researchers can propose as many candidate models as time and imagination allow, with no negative consequences. In the compericalapproach to learning, there is only one basic problem: find a good hypothesis (compressor) to explain the empirical data. This search is extremely difficult, but it is perfectly legitimate to perform the search using a combination of human insight and brute computational force, since the model class never changes and complex models are appropriately penalized. In contrast, the traditional approach to learning involves two simultaneous goals. The researcher must find a small class that is expressive enough to solve the problem of interest, and develop an algorithm that can find a good hypothesis based on the data. This suggests the legitimacy of experimenting with different model classes and associated algorithms. But the idea of manual overfitting implies that this procedure contains a logical error.

2.3 Indirect Learning

2.3.1 Dilbert’s Challenge

Consider the following scene that takes place in the dystopian corporate world portrayed in the Dilbert cartoon. Dilbert is sitting at his cubicle, working sullenly on some irrelevant task, when the pointy-haired boss comes up to him and delivers the following harangue:

I’ve been told by the new strategy consultants that exactly one year from today, our company will need to deploy a highly robust and scalable web services application. Deploying this application on time is absolutely critical to the survival of the company. Unfortunately, the consultants won’t know what the application is supposed to do for another 11 months. Drop everything you’re doing now and get started right away.

At first, this seems like a unreasonable demand. Surely it is impossible for anyone - let alone the beleaguered Dilbert - to develop any kind of significant software system in a single month. In a month, a team of highly skilled and motivated developers might be able to develop a prototype. And software designs are a bit like pancakes: one should always throw away the first batch. Even if, by some miracle, Dilbert got the design right on the first attempt, he would still have to do several iterations of polishing, debugging, and quality assurance before the software was ready for a public release.

Still, after thinking about the problem for a while, Dilbert begins to believe that it is actually an interesting challenge. He begins to see the problem in a new light: the specifications are not completely unknown, merely incomplete. For example, the boss mentioned that the application will need to be “scalable” and will involve “web services”. Dilbert has read stories about how highly skilled programmers, by working nonstop and using powerful tools, were able to build sophisticated software systems very rapidly. Dilbert decides to use the 11 month preparation period to familiarize himself with various technologies such as web application frameworks, debugging tools, databases, and cloud computing services. He also resolves to do a series of high-intensity training simulations, where he will attempt to design, implement, and test a software system in the space of a single month. Dilbert has no illusions that this preparation will allow him to perform as well as he could with a full year of development time. But he is confident that the preparation will make his ultimate performance much better than it would otherwise be.

Situations that are conceptually similar to Dilbert’s Challenge arise very often in the real world. The military, for example, has no way to predict when the next conflict will take place, who the opponents will be, or under what conditions it will be fought. To deal with this uncertainty, the military carries out very general preparations. Sometimes these preparations are ineffective. During the occupation of Iraq, the United States Army deployed a large number of heavily armored Abrams tanks, which contributed very little to the urban counterinsurgency efforts that constituted the core of the military mission. Overall, though, the military’s preparation activities are generally successful.

Another example of a Dilbert Challenge appears in the context of planning for the long-term survival of humanity. Many people who think about this topic are concerned about catastrophic events such as asteroid impacts that could destroy human civilization or completely wipe out humanity. While each individual category of event is unlikely, the overall danger could be significant, due to the large number of different catastrophes: plague, nuclear war, resource depletion, environmental disaster, and so on. By the time the dire event is near enough to be predicted with certainty, it might be too late to do anything about it. The futurists must therefore attempt to prepare for an as-yet-unspecified danger.

A full consideration of Dilbert Challenges that appear in practice, and methods for solving them, is the subject for another book. The relevance of the idea to the present discussion is that the precise nature of real-world cognitive tasks are often unknown in advance. For this reason, learning should be thought of as a general preparation strategy, rather than a direct attempt to solve a particular problem. This idea is illustrated in the following thought experiment.

2.3.2 The Japanese Quiz

Figure 2.2: Japanese language quiz.

Sophie is in the city of Vancouver for a conference, and has decided to do a bit of sightseeing. She is wandering around the city, pondering for the thousandth time the implications of the double-slit experiment, when a group of well-dressed elderly Asian gentlemen approach and request her attention. She is vaguely worried that the are members of some strange religious sect, but they claim to be members of the Society for the Promotion of Japanese Language and Literature, and they are organizing a special contest, which has a prize of ten thousand dollars. They show Sophie a sheet of paper, on which 26 Japanese sentences are written (Figure 2.2). The first 20 sentences are grouped into two categories marked A and B, while the remaining 6 are unmarked. The challenge, they inform Sophie, is to infer the rule that was used to categorize the first twenty sentences, and to use that rule to guess the category of the second set of sentences.

Sophie studies the sentences intently for several minutes. Her spirits sink rapidly as she realizes the quiz is nearly impossible. The complex writing is completely incomprehensible to her. She makes an educated guess by attempting to match unlabeled sentences to similar labeled ones, but she has very little confidence in her guess. She marks her answers on the paper and submits it to one of the gentlemen. He looks it over for a moment, and then regretfully informs Sophie that she did not succeed. The sentences, he explains are newspaper headlines. The sentences of category A relate to political affairs, while category B is about sports. The correct response, therefore, is AABBAB.

The Japanese apologize for the difficulty of the test, but then inform Sophie that they will be giving the quiz again next year. She looks at them for a moment, puzzled. “Do you really mean,” she asks, “that I can come back here next year and take the same test, and if I pass I will win ten thousand dollars?” Somewhat abashed, the gentlemen respond that it will of course not be exactly the same test. There will be a new set of sentences and a new rule used to categorize them. But the idea will be basically the same; it will still be a Japanese test. “So,” Sophie asks, suspiciously, “what is to prevent me from spending the whole year studying Japanese, so that I will be ready, and then coming back next year to collect the ten thousand dollars?” The gentlemen reply that there is nothing to prevent her from doing this, and in fact giving people an incentive to study Japanese is exactly the point of the Society!

The Japanese Quiz thought experiment raises several interesting issues. First, notice that the challenge facing Sophie is conceptually analogous to the challenge facing Dilbert mentioned above. In both cases, the protagonist must find a way to exploit a long preparation period in order to perform well on a test that is only incompletely specified in advance. The difference is that in Sophie’s case, it is immediately obvious that a preparation strategy exists and, if pursued diligently, will produce much improved results on the test.

The second issue relates to the decisive role played by background knowledge in the quiz. Any skilled reader of the Japanese language will be able to solve the quiz easily. At the same time, a person without knowledge of Japanese will be unable to do much better than random guessing. An untrained person sees the sentences as merely a collection of odd scribbles, but the proficient Japanese reader sees meaning, structure, and order. It is as if, in the mind of the Japanese reader, the sentences are transformed into a new, abstract representation, in which the difference between the two groups of sentences is immediately obvious. In other words, two people can observe exactly the same pool of data, and yet come to two utterly different conclusions, one of which is far superior to the other.

The Japanese Quiz thought experiment poses something of a challenge for the field of machine learning. It is very easy to phrase the basic quiz in terms of the standard formulation: the sentences correspond to the raw data objects and the categories are the labels . Using this formulation, any one of the many algorithms for supervised learning can be applied to the problem. However, while the algorithms can be applied, they will almost certainly fail, because the number of samples is too small. Solving the quiz requires a sophisticated understanding of Japanese, and such understanding cannot be obtained from only twenty data samples. But the fact that the learning algorithms will fail is not the problematic issue, since the task cannot be solved by an intelligent unprepared human, either. The problematic issue relates to the preparation strategy for next year’s test. For a human learner, it is obvious that an intense study of Japanese will produce a decisive advantage in the next round. This suggests that the same strategy should work for a machine learning system. The problem is that machine learning theory provides no standard way to operationalize the strategy “study Japanese”. Since the preparation strategy will provide a decisive performance advantage, the fact that the philosophy of machine learning provides no standard way to formalize it indicates that the philosophy has a gaping hole.

In contrast, the comperical philosophy provides a clear way to formalize the preparation process in computational terms. To prepare for the quiz means to obtain a large corpus of raw Japanese text and perform a CRM-style investigation using it as a target database. A system that achieves a strong compression rate on the text corpus can be said to “understand” Japanese in a computational sense. The abstractions and other knowledge useful for compressing the text will also be useful for solving the quiz itself; this is just a restatement of the Reusability Hypothesis of Chapter 1.

2.3.3 Sophie’s Self-Confidence

Sophie has always been a bit of a radical, and, upon returning from the conference, decides to neglect her research and teaching duties and dedicate all of her free time to learning Japanese. She is also an autodidact and a loner, so she does not bother to sign up for classes. Instead, she decides to hole herself up in the library with a large quantity of educational material: textbooks, dictionaries, audio guides, and so on. She also brings a lot of Japanese pop culture products: TV shows, movies, romance novels, comic books, and radio programs.

At the beginning of her studies, Sophie is very aware of her own lack of understanding. The complicated kanji characters are indistiguishable and meaningless to her. When listening to the spoken language, the phonemes blend together into an unintelligible stream from which she cannot pick out individual words. Once she begins to study, of course, the situation changes. She learns to read the simplified hiragana and katakana syllabaries, as well as a couple of the basic kanji. She memorizes some of the most common words, and learns the basic structure of a Japanese sentence, so that she can usually pick out the verb and its conjugation pattern.

After a couple of months, Sophie has absorbed most of the material in the textbooks. At this point she begins to rely almost exclusively on cultural material. She watches television programs, and is often able to glean new words and their meaning by matching the spoken Japanese to the English subtitles. With the help of an online dictionary, she translates articles from Japanese newspapers into English. She creates groups of similar kanji, and invents odd stories for each group, so that she can remember the meaning of a character by recalling the story. She cultivates a taste for Japanese pop music, which she listens to on her MP3 player at the gym.

Finally, after a year of intense study, Sophie’s comprehension of the language is nearly complete. She knows the meaning and pronounciation of the two thousand most important kanji, and she is familiar with all of the standard grammatical rules. She can read books without consulting a dictionary; if she comes across a new word, she can usually infer its meaning from the context and the meaning of the individual kanji from which it is constituted. She can watch television shows without subtitles and understand all of the dialogue; she is even able to understand Japanese comedy. At the conclusion of her year’s worth of study, she is completely confident in her own command of the language.

Sophie’s self-confidence in her new language ability raises an important philosophical question: how does she know she understands Japanese? Note that she is not taking classes, consulting a tutor, or taking online proficiency tests. These activities would constitute a source of external guidance. It would be easy for Sophie to justify her self-confidence if, for example, she took a weekly online quiz, and noticed that her scores progressively increased. But she is not doing this, and is nevertheless very confident in her own improvement. Sophie may be exceptionally bright and driven, but her ability to guage her own progress is not unique. Anyone who has ever learned a foreign language has experienced the sensation of gradually increasing competence.

The premise of machine learning is that the same basic principles of learning apply to both humans and artificial systems. In machine learning terms, Sophie’s brain is a learning system and her corpus of Japanese language material is an unlabeled dataset. The thought experiment indicates, therefore, that it is possible to gauge improvements in the competence of a learning system based solely on its response to the raw data, without appealing to an external measurement such as its performance on a labeled learning task or its suitability for use in a practical application. Sophie can justifiably claim proficiency at Japanese without taking a quiz that requires her to categorize sentences by topic or identify the part of speech of a particular word. Analogously, it should be possible to demonstrate the competence of a learning system without requiring it to solve a sentence categorization task or a part-of-speech tagging problem.

Sophie’s self-confidence is easily accounted for by the comperical philosophy. The philosophy equates the competence of a system with the compression rate it achieves on a large corpus of raw data. The system can gauge its own performance improvement because it can measure the encoded file size. A human language learner adopts a mindset analogous to the Circularity Commitment of empirical science: the value of a new idea or insight depends solely on the extent to which the idea is useful in producing a better description of the language. Furthermore, the belief that substantial open-ended learning will result in good performance on quizzes or other applications is a simple restatement of the Reusability Hypothesis. A determined search for a good theory of English must, for example, discover significant knowledge relating to part-of-speech tagging, since such information is very important in describing the structure of text. The point of these arguments is not to claim that the brain is actually performing data compression, although there is a theory of neuroscience that claims just that [6, 3]. However, in order to explain the self-confidence effect, whatever quantity the brain is in fact optimizing as it learns must be similar to the compression rate in the sense that it can be defined based on the raw data alone.

2.3.4 Direct and Indirect Approaches to Learning

The traditional perspective on the learning problem can be thought of as the “direct” approach. The philosophy of the direct approach can be summed up by the following maxim, due to Vapnik:

When you have a limited data set, attempt to estimate the quantities of interest directly, without building a general model as an intermediate step [116]*.

The reason one ought not to construct a general model is that such a model is more complex, and thus will run the risk of overfitting when the data is limited. Because of the relationship between model complexity and generalization, practitioners of the direct approach must be fanatical about finding the simplest possible model. Indeed, the most successful modern learning algorithm, the Support Vector Machine, is successful exactly because of its ability to find an especially simple model (called a separating hyperplane

[118]. In the direct approach, aspects of the data other than the quantity of immediate interest are considered unimportant or inaccessible. When developing a face detection system, the researcher of the direct school does not spend much time considering the particular shape of nostrils or the colors of the eyes. Insofar as it goes, the direct approach cannot be critiqued: Vapnik’s maxim does indicate the surest way to find a specific relationship of interest.

The comperical philosophy of learning suggests an alternative, “indirect” approach. This approach is based on the following maxim, due to Thomas Carlyle:

Go as far as you can see; when you get there you will be able to see farther.

Carlyle’s maxim suggests that instead of focusing on one specific quantity of interest, learning researchers should adopt a mindset of general curiousity. The researcher should go as far as she can see, by learning anything and everything that can be learned on the basis of the available data. In the study of faces, the comperical is happy to learn about noses, eyes, eyebrows, lips, facial hair, eyewear, and so on. In the study of the Japanese language, the comperical is happy to learn how varying levels of politeness can be expressed by verb conjugation, how a single kanji can have multiple pronounciations, how Japanese family names often evoke natural settings, and so on. This mindset is analogous to the Circularity Commitment of empirical science, where a theory is evaluated based on its ability to describe the data (e.g. face images) well, not on its practical utility. At the same time, the comperical endorses the Reusability Hypothesis, which in this context says that a systematic study of the raw data will lead to practical applications. Knowledge of the shapes of eyebrows and nostrils will help to solve the problem of face detection. Knowing how Japanese verbs conjugate will help to solve the problem of part-of-speech tagging.

2.4 Natural Setting of the Learning Problem

2.4.1 Robotics and Machine Learning

The most obvious application of machine learning methods is for the development of autonomous robotic systems. The field of robotics has blossomed in recent years, and there are now many groups across the world working to construct various types of robots. However, the interaction between the field of machine learning and the field of robotics is not as productive as one might imagine. In an ideal world, the relationship between roboticists and learning researchers would mirror the relationship between physicists and mathematicians. The latter group works to develop a toolbox of theory and mathematical machinery. The former group then selects the appropriate tool from the large array of available choices. Just as Einstein adapted Riemannian geometry for use in the development of general relativity, roboticists should be able to select, from an array of techniques in machine learning, one or two methods that can then be applied to facilitate the development of autonomous systems.

Unfortunately, in practice, the theory offered by machine learning does not align with the needs of roboticists. Roboticists often attempt to use machine learning techniques in spite of this mismatch, usually with disappointing results. In other cases roboticists simply ignore ideas in machine learning and develop their own methods. The Roomba, one of the few successful applications of robotics, employs simple, low-level algorithms that produce spiraling, wall-following, or even random walk motion patterns. Other commercial applications of robotics are not autonomous, but instead require a human user to control the system remotely.

Part of the reason for the disconnect between the two fields is that robots must take actions, and the problem of action selection cannot be easily transformed into the canonical form of supervised learning. Another subfield of machine learning, called reinforcement learning, is more applicable to the problem of action selection. This subfield is also hindered by deep conceptual problems, as discussed below.

2.4.2 Reinforcement Learning

All humans are intimately familiar with the opposing qualia called pleasure and pain. These sensations play a crucial role in guiding our behavior. A painful stimulus produces a change in behavior that will make it less likely for the experience to be repeated. After being stung by a bee, one learns to avoid beehives. Conversely a pleasant stimulus produces a behavior shift that makes it more likely to experience the sensation again in the future. The psychological study of how behavior changes in response to pleasure and pain is called behaviorism. The most radical theories developed in this area, associated with the psychologist B.F. Skinner, assert that all brain function can be understood in terms of reinforcement [107].

Critics of this perspective promote a more nuanced view. While there can be little doubt that reinforcement learning plays a role, many aspects of human behavior cannot be accounted for by behavioral principles alone. One such aspect, pointed out by Noam Chomsky in a famous critique of the behaviorist theory, is language learning [23]. Parents neither punish their children for making grammatical mistakes, nor reward them for speaking properly. In spite of this, children are able to learn very sophisticated grammatical rules, suggesting that humans possess a learning mechanism that does not depend on the reward signal.

While few psychologists accept behaviorism as a full explanation of human intellgence, the idea has recently taken on a new significance to researchers in a subfield of machine learning called Reinforcement Learning (RL). These researchers attempt to use the reward principle as a tool to guide the behavior of intelligent systems. The basic idea is simple. One first defines a reward function for a system, that produces positive values when the system performs as desired, and negative values when the system fails. One then applies a learning algorithm that iteratively modifies a controller for the system, until it finds a version of the controller that succeeds in obtaining a large reward. There are many such algorithms, but they are mostly similar to the simplified algorithm mentioned in Section 2.1.2, where is the reward function, is the controller, and the parameters are iteratively refined to maximize reward. If the algorithm works, then the researcher obtains a good controller for the system of interest, without needing to do very much work.

Because of the connection it makes to human psychology, and because of the prospect of automatically obtaining good controllers, the idea of reinforcement learning seems quite attractive. Roboticists often attempt to apply RL algorithms to learn controllers for their robots. Unfortunately, researchers have been frustrated in their ambitions by a variety of deep obstacles. The most obvious issue is that RL algorithms typically represent the world as a number of discrete states. The algorithms work well if the number of states is not too large. However, in the real world applications such as robot control, the system of interest is multidimensional, and the number of states grows exponentially with the number of dimensions. Take for example a humanoid robot with 53 joints. Even if each joint can only take 5 different values, then the total number of states is

. And this huge number only suffices to describe the configuration of the robot; in practice it is necessary to describe the state of the world as well. This issue is called the Curse of Dimensionality in the RL literature 


While the Curse of Dimensionality is bad enough, the RL approach to learning has another critical conceptual limitation. This is the fact that while the reward signal can theoretically be used to guide behavioral learning (i.e. decision-making), it cannot plausibly be used to drive perceptual learning. Intelligent adaptive behavior cannot possibly be achieved without a good perceptual system. The perceptual system must be able to transform an incoming image into an abstract description containing the concept “charging rhinoceros” in order for the decision-making system to determine that “run away” is the optimal action. The necessity of using a mechanism other than the reinforcement signal to guide perceptual learning is illustrated by the following thought experiment.

2.4.3 Pedro Rabbit and the Poison Leaf

In the woods behind Thomas’ house there lives a rabbit named Pedro. Pedro is a simple creature, who lives entirely in the moment, avoiding pain and seeking pleasure. He spends his days eating grass and leaves, avoiding foxes, and chasing after pretty female rabbits. Though his world is quite narrow, he is very familiar with it. As he wanders, he intently studies the things he observes. He becomes very knowledgeable, about the landscape and the other wildlife. He also becomes an expert about foliage, which is very important to him because it constitutes his diet. He notices that plant leaves have certain kinds of characteristics that tend to repeat over and over again. For example, some leaves are needle-shaped, while others are circular or hook-shaped. Some leaf edges are jagged, while others are smooth, and others have tiny hairs.

One day Pedro goes a bit further afield than usual. As he is grazing, he comes upon a new type of plant that he has never before encountered. He takes a moment to study the shape of its leaves, which are hook-shaped, with wavy indentations on the edges. The veins run up and down the length of the leaf. He has never seen this particular combination before. However, this does not surprise him much, as the forest is large and he often encounters new types of plants. He eats a few of the leaves, and then continues on his way.

A couple of hours later, Pedro becomes intensely sick. He grows feverish and starts to see double. He can barely manage to drag himself back to his burrow before he collapses.

When Pedro wakes up the next day, his entire body is throbbing with pain. Recalling his violent sickness of the day before, he is surprised that he is still alive. He makes a solemn vow to his personal rabbit deity to do whatever it takes to avoid experiencing that kind of awful sickness again. Since the pain seems to radiate outward from his stomach, he realizes that the sickness must have been caused by something he ate. Then he remembers the unusual new plant, and instantly concludes that it was the cause of his present grief. After another day of resting, he starts to get hungry and slowly drags himself up to go in search of food.

Pedro slowly wanders around the forest for a while, until he spies a plant that looks like it will make a good lunch. But just as he is about to bite off a leaf, he stops, petrified with terror. The leaf is green, thin, and grows on a bush - just like the one that poisoned him. He slowly turns around, and finds another leaf. It is also green, thin, and growing from a bush. Pedro realizes he has quite a dilemma. He has no choice but to eat leaves, or else he will starve. But he cannot bear to risk eating another poison leaf.

Thinking over the issue slowly in his rabbit fashion, he thinks that perhaps there is a connection between the appearance of a leaf and whether or not it is poisoned. This might allow him to determine whether a leaf was poisoned or not without eating it. But even if this were true, how can he possibly recognize the poisonous leaf if he sees it again? He has only seen it once!

Then Pedro recalls the unique appearance of the leaf, and its strange combination of hook shape, wavy edge indentation, and longitudinal venation pattern. By fixing that particular combination of characteristics in his mind, he realizes that he can recognize the poison leaf. With a flood of relief, he goes back to his old lifestyle of leaf-grazing, confident that if he ever encounters the poison leaf again, he will be able to avoid it.

This thought experiment is interesting because it shows a shortcoming of the Reinforcement Learning philosophy. In terms of reinforcement, Pedro has received a long series of minor positive rewards from eating regular leaves, followed by a massive negative reward from eating the poison leaf. Furthermore, his life depends on rapidly learning to avoid the poison leaf in the future: if it takes him multiple tries to learn an avoidance behavior, he will probably not survive. It also presents a dilemma for supervised learning, because Pedro needs to be able to obtain the correct classification rule on the basis of a single positive example.

In order to achieve the necessary avoidance behavior, Pedro must have a sophisticated perceptual understanding of the shapes of leaves. If Pedro saw a leaf merely as a green-colored blob, it would be impossible for him to decide whether or not it was poisonous. It is only because Pedro has a sophisticated perceptual system, that can identify various abstractions in the raw sensory data such as the shape, edge type, and venation pattern of the leaf, that Pedro is able to determine the correct classification rule. The problem is that this sophisticated perceptual system can never be developed on the basis of reward adaptation alone. That is because, in all his previous experience, it was sufficient for the perceptual system to report only the presence of a thin, green-colored blob growing on a bush. The precise appearance of the leaf never mattered in terms of reward, so reward adaptation could never be used to optimize the perceptual system. This implies that in order to learn the perceptual system, Pedro’s brain must have relied on some optimization principle other than reinforcement adaptation.

2.4.4 Simon’s New Hobby

Thomas has a 14 year old son named Simon, who is intelligent and a bit eccentric. Simon has a tendency to become obsessed with various hobbies, focusing on them to the exclusion of all else. Thomas worries about these obsessions, but can’t bring himself to forbid them, and even feels compelled to fund them liberally using his investment banker’s salary. Sophie and Simon are fond of one another, and Thomas often tries to get Sophie to influence Simon in various ways.

A couple of years ago, Simon became obsessed with bird-watching. He bought expensive binoculars, read all the field guides, and went on bird-watching expeditions to nearby state parks. On these expeditions he would find a good observation place and lie in wait for hours, hoping to catch sight of a Grey Heron or a Piratic Flycatcher. At the moment of sighting such a creature, he felt a dizzying burst of exhilaration. Eventually, though, impatience proved to be a stronger motivating force. Simon enjoyed seeing the birds, but the torment of the long wait was simply too much.

Recently, Simon has acquired a new hobby: digital surveillance. Sophie hears bits and pieces about this from Thomas, in the form of terse emails. “Simon has got cameras set up all over the house and in the back yard.” “Now he’s got microphones, too, he’s recording everything we say.” “Sophie, now he’s attached the camera to a remote control car, he’s driving it around and spying on people and looking for birds.” “He’s got some kind of pan-tilt-zoom camera set up and he’s controlling it with his computer.” “He’s got a whole room full of hard drives. He says he has a terabyte of data.” “Sophie, you’ve got to help me here. The kid has a remote control helicopter flying around looking for birds, I’m afraid it’s going to crash into someone’s house.”

Sophie wants to help her brothers, but also likes to encourage Simon’s hobbies. On her next visit, she suggests to him that he write a computer program that can automatically recognize a bird from its call. By connecting this program to the treetop microphones he has set up, he could be alerted if a interesting bird was nearby, and go see it without suffering through a long wait. Simon is very excited by this idea and so Sophie buys him a copy of Vapnik’s book “The Nature of Statistical Learning Theory” 

[116]*. She tells him that it is a book about machine learning, and some of the ideas might be useful in developing the bird call recognition system. Thomas, observing this, expresses some skepticism: Simon can’t possibly understand the generalization theorems, since nobody understands the generalization theorems. Sophie suggests that Thomas might be surprised by the intellectual powers of highly motivated youngsters, and that anyway it is better for Simon to be studying mathematics than crashing his mini-helicopter into houses.

A few days later, though, Sophie receives a telephone call from Simon. “Aunt Sophie,” he says, “this book is useless to me! It says right on the first page that it’s about limited

data. I don’t care about limited data problems. I’ve got 15 microphones recording at 16 kilobits per second, 24 hours a day.” Sophie explains to Simon that to use the statistical learning ideas to build a classifier to recognize the bird calls, he will need to label some of the audio clips by hand. So while he might have huge amounts of raw data, the amount of data that is really useful will be limited. “So, let’s say I label 100 ten second audio clips. That’s about 16 minutes, but I have thousands of hours worth of data, what am I supposed to do with it?” Sophie responds that this portion cannot really be used. Simon listens to this explanation sullenly. Then he asks how many audio clips he will need to label. Sophie doesn’t know, but suggests labelling a hundred, and seeing if that works. Then he asks how he is supposed to find the bird calls in the audio data: that is exactly the problem that the detector is supposed to solve for him. Sophie doesn’t have a good answer for this either. Simon does not seem happy about the situation. “I don’t know about this, Aunt Sophie… it just seems really inefficient, and


2.4.5 Foundational Assumptions of Supervised Learning

Supervised learning theorists make a number of important assumptions in their approach to the problem. By examining these assumptions, one can achieve a better understanding of the mismatch between the tools of machine learning and the needs of robotics researchers. To a great extent, the philosophical commitments of supervised learning are expressed by the first sentence of the great work by Vapnik (emphasis in original):

In this book we consider the learning problem as a problem of finding a desired dependence using a limited number of observations [116].

According to Vapnik, learning is about finding a specific “desired dependence”, and all attention is devoted to discovering it. In practice, the dependence is almost always the relationship between a raw data object such as an image, and a label attached to it. In this view, there is very little scope for using raw data objects that have no attached label. Because all data objects need a label, and it is typically time-consuming and expensive to produce labeled datasets, the number of data samples is fundamentally limited. This restriction on the amount of the available data constitutes the core challenge of learning. Because of the limited amount of data, it is necessary to use extremely simple models to avoid overfitting. A third assumption is embedded in the word “observations”, which in practice implies the data samples are independent and identically distributed (IID). The next section argues that in the natural setting of the learning problem, none of these assumptions are valid.

Other machine learning researchers have proposed different formulations of the learning problem that depend on different assumptions. Focusing specifically on Vapnik’s formulation is a simplification, but for several reasons it is not entirely unreasonable. First, Vapnik’s formulation is very nearly standard; many other approaches are essentially similar. Many machine learning researchers work to extend the VC theory or apply it to new areas [55]. Second, Vapnik’s theory is the basis for the Support Vector Machine (SVM) algorithm, one of the most powerful and popular learning algorithms currently in use. Much research in fields such as computer vision and speech recognition use the SVM as an “off-the-shelf” component.

2.4.6 Natural Form of Input Data

There are several perspectives from which one could approach the study of learning. A psychologist might study learning to shed light on the function of the human brain. A roboticist might wish to use learning to construct autonomous systems. And a pure observer, like Simon, might be interested in learning as a way to make sense of a corpus of data. These problems are not identical, but they do share several important characteristics. Neither the supervised learning approach nor the reinforcement learning approach is exactly appropriate for these problems. The reason is that those formalisms make very specific assumptions about the form of the input data that is available to the learning system. The basic characteristics of the natural form of the input data are described below. Data is a Stream

For robots and organisms operating in the real world, and also for pure observers, data arrives in the form of a stream. Each moment brings a new batch of sensory input. Each new batch of data is highly dependent on the previous batch, so any attempt to analyze the data as a set of independent samples is misleading from the outset.

Consider, for example, a camera placed in the jungle. The camera records a 12 hour video sequence in which a tiger is present for five seconds. How much data about the tiger does this represent? If one considers that the appearance of the tiger constitutes a single data sample, then it will be very difficult to learn anything without overfitting, because of the small size of the data set. Perhaps, on the other hand, one notes that the five seconds corresponds to 150 image frames. Using this substantially increased figure for the size of the data justifies the use of a much more complex model. But it also causes problems with the traditional formulation, since the 150 frames are certainly not independent: each frame depends strongly on the previous frames. One could take this even further, and say that within each image there are 768 subwindows of size , so there are really positive data samples. The success of any given approach to learning from this data set will depend strongly on the data partitioning scheme employed, but the standard formulation of machine learning gives no instruction about which scheme is correct. The Stream is Vast

Not only does data arrive in the form of a stream, it is a enormous stream. A cheap digital web camera, available from any electronics or computer store, can easily grab frames at 30 Hz, corresponding to data input rates on the order of megabytes per second. And cameras are not the only source of data: robotic systems can also be equipped with microphones, laser range finders, odometers, gyroscopes, and many other devices. Humans experience a similarly huge and multi-modal deluge of sensory data.

Standard formulations of machine learning do not take into account the vast size of the input data. Indeed, Vapnik’s formulation explicitly assumes that the input data is limited, and that the limitations constitute the core of the challenge of learning. The reinforcement learning formulation is also maladapted to the natural form of the input data. RL theory assumes that observations are drawn from a set of symbols and delivered to the learning system. In principle, the theory works regardless of the form the observations take. But in practice, learning will be disastrously inefficient unless the observations transformed from raw data into abstract perceptions that can used for decision making. If the learning system receives an observation labelled “tiger”, it may reasonably conclude that the correct action is “escape”. But an observation delivered merely as a batch of ten million pixels is nearly useless. Supervision is Scarce

Biological organisms, real-world robots, and observational learning systems all receive data in a vast, mult-modal stream. These entities may use different types of supervision mechanisms to produce adaptive behavior. An organism is guided by a natural reward signal that is determined by its biological makeup. A real robot can be guided by an artificial reward signal that is decided by its developer. In the context of observational learning, the analogue of the reward signal is the set of labels attached to the data. These labels can be used to guide the learning system to reproduce some desired classification behavior. The key point is that in all three cases, the information content of the supervisory data is small compared to the vast sensory input data.

Simple introspection should be sufficient to prove this claim in the context of real organsisms. Humans certainly experience moments of intense pleasure and pain, but most of our lives is spent on a hedonic plateau. On this plateau the reward signal is basically constant: one is content and comfortable, neither ecstatic nor miserable. Since the reward signal is roughly constant, very little can be learned from it. At the same time, however, our sensory organs are flooded with complex information, and this data supports extensive learning.

The same thing holds for information received from parents, teachers or supervisors. A child may receive some explicit instruction in language from his mother, but most instruction is purely implicit. The mother does not provide a comparative chart illustrating the differences between grammatical and ungrammatical speech; she only provides many examples of the former. The feedback received by students from their teachers also has low information content. An aspiring young writer could not plausibly plan on mastering the art of writing from the comments made by her professors on her term papers. Children learn difficult motor skills such as running, jumping and hand-eye coordination without any explicit instruction.

In the Reinforcement Learning approach, a natural tendency is to attempt to have robots learn sophisticated behavior by using complex reward functions. Imagine that a researcher who has built a humanoid robot, and now wishes to train it to fetch coffee in an office environment. The simplest method would be to give the robot a big reward when it arrives with the coffee. This scheme implies, however, that the robot will have no guidance at all until it successfully completes a coffee-fetching mission. Until then, it will explore its state space more or less at random (it may, for example, spend a long time trying to do hand stands). In this case the inclination of the RL researcher is to develop a more complex reward signal. The robot may be given a small reward just for obtaining the coffee, since this is an important first step toward delivering it. But this is still not adequate, since it will require a long time, and many broken cups and much spilled coffee, before the robot ends up with a full cup in its hand as a result of random exploration. One could then go further, defining all kinds of small rewards for each small component of the full task, such as standing, walking, navigating the hallways, grasping the cup, and so on. One would also need to define negative rewards for spilling the coffee or bumping into people.

This complex reward strategy may work, but it is unattractive for several reasons. First, it requires enormous efforts on the part of the researcher. It almost discards the entire point of machine learning, which is for computers to discover solutions on their own. Also, the resulting system will be extremely brittle. If it researcher decides to change course in mid-development and teach the robot to mow lawns instead of fetching coffee, a huge amount of the previous work will have to be discarded. Third, it still does not address the problem of learning perception.

2.4.7 Natural Output of Learning Process Dimensional Analysis

In physics there is a fascinating technique called dimensional analysis, that allows one to find the rough outline of a solution without doing any work at all. Dimensional analysis relies on the fact that most quantities used in physics have units, such as kilograms, seconds, or meters. Dimensional analysis provides a nice way of checking for obvious mistakes in a calculation. Say the quantity is measured in meters, and the quantity is measured in seconds. Now, if the result of a calculation involves a term of the form , then there is obviously a mistake in the calculation, since it is nonsense to add meters to seconds.

Dimensional analysis actually permits even more powerful inferences. Consider the standard problem where a ball oscillates due to the influence of a spring. There are no other forces, and the system obeys Hooke’s law so that , where is the force from the spring, is the displacement of the ball. The problem contains two parameters: , the mass of the ball, measured in , and , the spring constant, measured in . The goal is to find , the period of the oscillation, which is measured in . There is only one way to combine and to get a quantity that has the appropiate units: . That is in fact the right answer, up to a constant factor (the actual answer is ). So dimensional analysis tells us the form of the answer, without requiring any actual calculation at all. Predicting the Stream

Dimensional analysis suggests that the correct form of an answer can be deduced merely from the inputs and outputs of a problem. In the natural setting of the learning problem, the input is a vast, multimodal data stream with minimal supervision. The output is some computational tool that can be used to produce adaptive behavior. There seems to be one natural answer: the learning algorithm should attempt to predict the stream.

The problem of prediction is extremely rich, in the sense that a wide array of potential methods can be used to contribute to a solution. The most basic prediction scheme is trivial: just predict each pixel in an image sequence based on its value at the previous timestep. Conversely, perfect prediction is impossible - it would require, for example, the ability to predict the outcomes of historical events, since such events interact with day to day experience. Between the two extremes exist a wide variety of possible methods, techniques, and theories that will yield varying levels of predictive power.

The relationship between compression and prediction has already been established. Because of this connection, the quality of a prediction system can be evaluated by measuring the compression rate it achieves on a data stream produced by the phenomenon of interest. The comperical approach is appropriate for all three types of learning described above. Simon can conduct a comperical investigation of the audio data produced by his treetop microphones. Such an investigation could lead to a method for recognizing a bird species from its song. A biological agent could conduct a CRM-like investigation based on the rich sensory data it observes. In the case of Pedro Rabbit, such an investigation could produce a sophisticated understanding of the appearance of plant leaves, before such an understanding becomes critical for survival. An autonomous robot could, in principle, be programmed to conduct a CRM-style investigation of the data it observes. In practice, however, it is probably more reasonable for a group of comperical researchers to develop visual systems ahead of time, and then port the resulting systems onto the robot when it is built.

2.4.8 Synthesis: Dual System View of Brain

The previous discussion expressed some skepticism related to the ability of the reinforcement learning principle to provide a full account of intelligence. However, it would be foolish to completely dismiss the significance of the reward signal; humans and other organisms certainly do change their behavior in response to pleasure and pain. It is easy to reconcile the reinforcement learning approach with the comperical approach by noting that the ability to predict is extremely useful for maximizing reward. Given a powerful prediction system, it is easy to connect it to a simple decision-making algorithm to produce a reward-maximizing agent. To see this, consider an agent that has experienced a large amount of data with visual, audial, and reward components. The sensory data is also interleaved with a record of the actions taken by the agent. Assume that the agent’s learning process has obtained a very good predictive model of the data. Then the following simple algorithm can be applied:

  • Create a set of proposal plans.

  • For each plan use the predictive model to estimate the reward resulting from executing it in the current state.

  • Execute the plan corresponding to the highest predicted reward.

This decision-making algorithm is not very elegant. It also incomplete, since it does not specify a technique for creating the list of proposal plans. Moreover, the strategy may fail badly if the reward prediction function is inaccurate. However, these limitations are not entirely discouraging, because, introspectively, humans seem to be very poor planners. Humans are often paralyzed with indecision when presented with difficult choices, and frequently make decisions simply by imitating others, or by repeating their own past behavior. This suggests that naïve methods for constructing the list of proposal plans may provide adequate performance. The fact that humans can act in a reasonably adaptive way without using a very sophisticated planning system suggests that the inelegant algorithm given above can achieve comparably acceptable performance.

The above argument implies that prediction is sufficient for obtaining reasonably good reward maximizing behavior. A moment’s reflection shows that in many cases prediction is also necessary. A system that does not make predictions could, perhaps, be used for very controlled problems such as maximizing the production rate of a paper factory. But real agents need to make predictions to deal with situations they have never before encountered. It is essentialy to know in advance that jumping off a cliff will be harmful, without actually performing the experiment. This is obvious, but standard reinforcement learning algorithms cannot learn to avoid bad states without visiting them at least once.

The more general principle here is the idea that learning can take place without input from a teacher, critic, supervisor, or reward signal. This is true even if the ultimate goal is to please the teacher or maximize the reward signal. The poison leaf thought experiment indicated that in order to survive, Pedro needed extensive knowledge of the shapes of leaves before he consumed the poisoned one. Such extensive knowledge could not be produced by a purely reward-based learning process, since before eating the poison leaf the reward signal was not correlated with the appearance of leaves. The poison leaf thought experiment reiterates the lesson of the Japanese quiz: for some cognitive tasks, good performance is only possible when extensive background learning has already been conducted.

These ideas suggest a new view of how the brain operates. In this view, the brain contains two learning systems. The first is an oracular system, whose goal is to make predictions (particularly action-conditional predictions). To do this, the oracular system learns about a large number of abstractions such as space, time, objects, relationships, and so on. Because it learns from the vast sensory data stream, the oracular system can grow to be quite sophisticated, and is able to obtain highly confident predictions about a wide variety of events. The oracular system learns in an unsupervised manner and attempts to maximize a quantity analogous to the compression rate.

To complement the oracular system, the brain also contains an executive system. The purpose of this system is to exploit the oracular system to produce adaptive behavior. This system uses the predictions and abstractions produced by the oracular system to help in optimizing the reinforcement signal. The executive system learns from the reward signal, which it attempts to maximize. But because the reward signal has relatively low information content, the executive system does not achieve a high level of sophistication. It produces clear signals about how to act only in special situations, such as when confronted with a charging rhinoceros. In other situations it produces only vague inclinations or urges.

This two-module explanation for intelligence is attractive from both an explanatory perspective and an engineering perspective. In terms of explanation, it shows how children can learn language at a high level of proficiency, in spite of the fact that they are not punished for making grammatical mistakes or much rewarded for speaking correctly. It provides a way for Pedro Rabbit to avoid the poison leaf after his initial encounter with it, since the general purpose system has provided him with an advanced perceptual understanding of the shape of leaves. It also accounts for the contrast between the strong confidence with which the brain can make perceptual predictions and the ambivalence with which it makes decisions.

In terms of engineering, the two-component system introduces a useful modular decomposition by splitting the notion of intelligence into two parts. The oracular system performs general purpose, open-ended learning, which produces a variety of high-level abstractions. To a first approximation, the learning process takes into account only the predictive power of the abstractions it considers, not their adaptive significance. In turn, the design of the executive system is considerably simplified by the fact that it can rely on the oracular system to deliver high quality abstractions and predictions. The conceptual simplification introduced by the two-system view can also be thought of as a modular decomposition between the low-level perceptual inputs and the decision making function. In the unsimplified case, the executive system computes an action based on the current state and history of all the perceptual input channels. But this computation is probably intractable, due to the vast amount of input data. The simplification is to introduce a set of intermediate variables corresponding to high-level abstractions. Then the influence of the low-level perceptions on decision making is mediated by the abstraction variables.

3.1 Representative Research in Computer Vision

The following sections contain a discussion and analysis of several representative research papers in computer vision. Most of the papers made strong impacts on subsequent technical work in the field, and some of them represent the current state of the art. But the purpose of presenting the papers is not to provide the reader with a comprehensive technical overview of computer vision methods. Such an overview would require its own book or perhaps bookshelf. Instead, the purpose is to illustrate the principles and philosophical commitments that guide research in the field. Since most vision researchers share the same general mindset, this can be done by describing a few key papers.

3.1.1 Edge Detection

Humans naturally perceive the visual world in terms of edges, lines, and boundaries. The goal of an edge detection algorithm is to replicate this aspect of perception, by marking positions in an image that correspond to edges. Different authors have formalized this goal in various ways, though everyone agrees that edge detection is more than simply finding positions where the image intensity changes rapidly. The consensus seems to be that the detector should identify perceptually salient edges, and should produce detection results that are useful for higher-level applications. The typical approach to the edge detection problem involves a new computational definition of the task, and an attempt to show that this new definition leads to results that agree with human perception or are practically useful.

A simplistic approach to the edge detection problem would be to find the positions in the image where the intensity gradient is high. The limitations of this simple approach can be understood by considering the issue of scale. Many images exhibit sharp intensity variations on a fine scale that are not really important. An example would be an image showing a patch of grass next to a sidewalk. The significant edge in this image is the one separating the grass from the sidewalk. However, the region containing the grass will also contain sharp intensity variations, corresponding to individual blades of grass. The ideal edge detector would ignore these fine scale variations, and report only the significant edge between the grass and sidewalk. The problem is to formulate this idea of ignoring fine-scale variations in computational terms that can be programmed into an algorithm.

One standard way of dealing with the issue of scale is to blur the image as a preprocessing step. Blurring transforms the grass region into a homogeneous patch of greenish-brown pixels. This wipes out the fine scale edges, leaving behind the coarse scale grass-sidewalk edge. Blurring also has the advantage of making the detector robust to noise. Unfortunately, the use of blurring introduces a new set of problems. One such problem is that blurring decreases the accuracy with which the position of the coarse scale edges are detected. Also, in some cases, it might be necessary to find the fine scale edges, and so throwing them away is the incorrect behavior.

Marr and Hildreth, in a paper boldly entitled “Theory of Edge Detection”, presented an early and influential approach to the problem [73]

. The paper contains two parts; the first part proposes an algorithm, and the second part shows how the algorithm might be implemented by neurons in the brain. The algorithm begins by blurring the image at various scales, for the reasons noted above. The second step searches for places where image intensity changes rapidly, since these correspond to edge locations. A useful trick for finding these positions is given by the rules of calculus. Rapid intensity changes correspond to extrema in the first derivative of the intensity function. An extremum in the first derivative corresponds to a zero in the second derivative. So the algorithm blurs the image and then searches for zeroes of the second derivative.

This approach would be sufficient if an image were one-dimensional. But images are two-dimensional, so when calculating intensity derivatives, one must specify the direction in which the derivative is taken. A key idea of the paper is the realization that, under reasonable conditions, it is acceptable to use the Laplace operator , which is just the sum of the second derivatives in two directions. The Laplace operator is invariant under rotations of the coordinate system, so using it will mark edges in the same position for an image and a rotated copy of the image.

The paper also presents the idea of a “primal sketch”, which is formed by calculating the edge positions at multiple scales (blurring levels), and combining the responses together into a network of line segments. The authors argue that this representation is complete: the network of segments, along with the amplitude values indicating the intensity of an edge, is sufficient to reconstruct the original image. Since the line segments can be thought of as discrete units, the transformation from the raw image to the primal sketch amounts to a transformation from the analog world of continuous values to the discrete world of symbols. Early artificial intelligence researchers thought intelligence was mostly about symbol processing, so they considered this transformation important because it served as a bridge between raw perception and logical, symbolic reasoning.

The second part of the paper is entitled “Implications for Biology”, which attempted to show that the proposed algorithm could actually be computed by neurons in the visual cortex. The paper analyzes known neurophysiological findings, and argues that various patterns of neural activity can be interpreted as performing the computations called for by the algorithm. For example, a specific proposal is that a certain type of neurons called lateral geniculate X-cells encode the result of applying the Laplace operator to the blurred image. The concluding discussion uses the theory to make some neuroscientific predictions.

Another important paper in this area is entitled “A Computational Approach to Edge Detection”, by John Canny [18]. Canny showed how a standard scheme of edge detection could be refined, making it robust to noise and more likely to report perceptually salient edges. The standard way to detect edges or other patterns is to apply a filter to an image, and then look for maxima in the filter’s response. For Canny, the main shortcoming of this scheme is that it is highly sensitive to noise. To Canny and other like-minded researchers, noise was not caused just by a flaw in the measuring apparatus (camera), but also by real distractions or uninteresting aspects of an image. The problem of scale can be thought of as a special case of the noise problem; the fine-scale edges of individual blades of grass are thought of as “noise” corrupting the homogeneous green of the grass patch. Noise is a problem because it can cause simplistic filters to report false edges, or to produce incorrect estimates for the locations of real edges. Canny wrote down a mathematical expression describing these two sources of error, and then derived a set of optimal filters that would minimize the expression. Different types of edges required a different filters. For example, one filter is optimal for roof edges (light-dark-light) and another is optimal for step edges (light-dark).

The main idea of the Canny paper is the derivation of the set of optimal filters, but it also describes several additional topics. Canny shows that, contra Marr-Hildreth, there are actually strong advantages in using directionally oriented edge detectors. The paper also describes the problem of searching for edges at multiple scales. The problem now becomes how to integrate the responses from the filters at multiple scales and orientations into one single result; a scheme called feature synthesis is proposed to do this. A third topic mentioned in the paper is a technique for estimating the noise in an image using a method called Wiener filtering. It is very useful to know the noise power, because if there is a lot of noise, then the filter response theshold for identifying a particular position as an edge should be raised. Positions that show filter responses below the threshold are marked as spurious edges and ignored.

3.1.2 Image Segmentation

Humans tend to perceive images in terms of regions. Most people will perceive an image of a car, a tree, and a pedestrian by packaging the pixels into regions corresponding to those objects.

The goal in the image segmentation task is to imitate this perceptual grouping tendency. An image segmentation algorithm partitions the pixels of an image into non-overlapping regions. Ideally, this partitioning should reflect real object boundaries, so that all the pixels corresponding to a car are packaged together, including tires and windows. Unfortunately, this level of performance has not yet been reached. In practice, most algorithms perform segmentation based on similarities in pixel color, so that the black pixels of the car tire would be packaged into one region, while the windshield pixels would go into another. Arguably, the output of an image segmentation algorithm might also be useful as an input to a higher-level vision system.

One of the most important papers on the topic of image segmentation is “Normalized Cuts and Image Segmentation” by Shi and Malik [104]. This paper proposes to use ideas from the mathematical field of graph theory to approach the segmentation problem. A graph consists of a set of nodes, some of which are connected by edges. A graph is called connected if for any pair of nodes in the graph, there is a path (i.e. sequence of edges) connecting them. Given a connected graph, a cut is a set of edges which, when removed, render the graph disconnected. The problem of finding the minimum cut - the cut involving the smallest set of edges - is a well-studied problem in graph theory and efficient algorithms exist to find it.

To use the minimum cut idea to do segmentation, Shi and Malik construct a graph by creating a node for each pixel, and placing an edge between two nodes representing neighboring pixels. Each edge has a weight corresponding to the degree of similarity between two pixels. Then the minimum cut of the resulting graph separates the image into two maximally-dissimilar regions. However, one refinement is necessary. By itself, the minimum cut rule tends to create cuts that isolate single pixels, since typically a pixel can be cut off by removing only four edges. Shi and Malik repair this undesirable behavior by proposing to use a normalized cut:

Where is the weight of the edges separating the two subregions, and is the weight of all the edges between and the full graph . If is a single pixel, then the ratio will be large, since all the edges included in will also be included in . Thus the minimum normalized cut for the graph should correspond to a good boundary separating two distinct regions in the image. Unfortunately the inclusion of the

terms make the problem harder than the basic minimum cut problem. The authors define a relaxed version of the basic minimum cut problem, where an edge can be fractionally included in a cut, instead of completely in or out. This relaxed problem can be cast as an eigenvector problem, and the authors give a specialized algorithm for solving it.

Another, more recent contribution to the image segmentation literature is a paper entitled “Efficient Graph-Based Image Segmentation” by Felzenszwalb and Huttenlocher [34]. These authors also use graph theory concepts to perform segmentation, but formalize the problem in a different way. Again, each pixel in the image corresponds to a graph node, adjacent pixels in the image are linked by edges in the graph, and edges have weights measuring the similarity between pixels. The paper defines two key quantities related to a pair of regions and . The first is a region-difference score , which is just the minimum weight edge connecting the two regions. The second key quantity is the following:

Where is an internal coherence score of the region , and is an arbitrary auxiliary function. The function attempts to quantify the internal similarity of the pixels in a region. This is defined as the maximum weight edge of the minimum spanning tree of the region. The minimum spanning tree of a group of nodes is the minimum set of edges required to make the group connected. If the minimum spanning tree contains a high weight node, it implies that the region has low internal coherence.

The basic segmentation principle of the paper is that a boundary should exist between two regions if the region-difference score is smaller than the paired internal coherence score. Roughly, this means that two regions should be merged if the internal coherence of the resulting region is no smaller than the internal coherence of the original two regions. The authors point out that an arbitrary non-negative function of the pixels in a region can be added to the internal coherence score without breaking the algorithm. This provides a nice element of flexibility, since can be used to prefer regions with a certain shape, color, or texture. This could allow the algorithm to be adapted for the specialized purpose of, for example, detecting faces.

Felzenszwalband Huttenlocher present an algorithm that can perform segmentation according to the principle defined above, and an analysis of its properties. The algorithm starts by creating a list of all the edges in the graph, ordered so that low weight edges, corresponding to very similar pixel-pairs, come first. The algorithm goes through the list of edges, and for each pixel-pair, determines if the regions containing the two pixels should be merged. This is done by comparing the edge weight to the internal coherence of the two regions containing the two pixels. The authors also prove several theorems, which basically say that the segmentations produced by the algorithm are neither too coarse nor too fine, according to a particular definition of these concepts. Finally, the authors show that the algorithm runs in time , where is the number of pixels in the image. This is quite fast; potentially fast enough to segment video sequences in real time, something that most other segmentation algorithms are too slow to do.

Both of the papers described above present only qualitative evidence to demonstrate the value of their proposals. The experimental validation sections show the results of applying the algorithm to a couple of images. The reader is expected to conclude that the algorithm is good based on a visual inspection of the output. There is no real attempt to empirically evaluate the results of the algorithm, or to compare it to other proposed solutions to the segmentation task.

3.1.3 Stereo Matching

Humans have two eyes located very close to one another. In normal situations, the two eyes observe images that are similar, but measurably different. This difference depends critically on depth, due to the way an image is formed by collapsing a 3D scene onto a 2D plane. If a pen is held at a large distance from a viewer, the pen will appear at nearly the same position in both eye-images, but a pen held an inch from the viewer’s nose will appear at markedly different positions. Thus, the depth of the various objects in a scene can be inferred by analyzing the differences in the stereo image pair. Vision researchers can emulate the conditions of human stereo observation by mounting two cameras on a bar, separated by a known distance.

The crucial difficulty in stereo vision to find matching points, which are positions in the two images that correspond to the same location in the 3D world. For example, if the 2D positions of the tip of a pen can be found in both images, this is a matching point, since the pen-tip occupies a specific location in the 3D world. From a single matched pair, a disparity function can be calculated, which can then be used to find the 3D depth of the point. Typically, though, the goal is to find a depth map, which represents the depth of all the points in the image. Finding large numbers of matching points is difficult for several reasons. Some points in 3D space might be visible in one eye-image, but occluded in the other image. If the image shows an object with very few distinguishing features, such as a white wall, then it becomes hard to find matching points. Some surfaces are specular, meaning that they reflect light in a mirror-like way, as opposed to a diffuse, incoherent way. Specularity can therefore cause the two cameras to observe very different amounts of light coming from the same point in 3D space.

An early effort in the area of stereo matching is a famous paper by Lucas and Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision”  [68]. This paper begins with a technique for solving a simplified version of the stereo matching problem, called image registration. Here it is assumed that the one of the images is a translated and linearly scaled copy of the other, and the task is to find the translation and scaling factor. In principle this could be done by an exhaustive search, but it would require a very long time, so the paper gives a more efficient algorithm. The algorithm starts with a guess of the correct translation. Then, it approximates the effect of making a small change to the translation by finding the derivative of one image at every point. That assumption allows an optimal update to the translation-guess to be computed, and then the algorithm repeats with the new guess. The same basic idea works to find the scaling factor. Once the scaling factor and the translation vector are known, a straightforward calculation can be used to find the depth of the object.

A more recent paper on the subject is “Stereo Matching Using Belief Propagation” by Sun, Shum, and Zheng [110]. This paper attempts to solve a more challenging version of the matching problem: it attempts to match each pixel with a partner in the other image. The paper formulates the matching task as a probabilistic inference problem, where three probabilistic variables are attached to each pixel. One variable corresponds to the pixel depth, another is true if there is a depth discontinuity at the pixel, and the third is true if the pixel is occluded. The intuition behind the model is that it is very unlikely for one pixel to have a depth of one meter, if the neighboring pixel has a depth of five meters, unless the discontinuity variable is true for the pixel. Also, if a pixel is the same color as its neighbors, it is very unlikely that a discontinuity exists at the given point. The authors formalize these relationships using a model called a Markov Random Field (MRF). The MRF expresses the probability of a variable configuration using purely local functions: the probability of a certain pixel depth value depends only on the depth values of neighboring pixels.

The authors propose to describe the relationship between the depth, occlusion, and discontinuity variables using a probabilisic model called a Markov Random Field (MRF). The MRF assigns a probability to a configuration of variables based on purely local functions, so that, for example, the probability of a certain pixel depth value depends only on the depth values of neighboring pixels. So the MRF model defines a relationship between the observed variables (the actual pixel values) and the hidden variables (depth, etc). The problem now is in a general form: find the highest probability configuration of some hidden variables given evidence from the observed variables. There are many techniques for approaching this general problem. An algorithm called Belief Propagation can solve the problem efficiently, but it is only guaranteed to work for a restricted class of simplified models. The authors propose to use Belief Propagation anyway, and accept that the resulting solution will be only an approximation.

3.1.4 Object Recognition and Face Detection

Object recognition is the problem of identifying objects in an image. In contrast to the general purpose tasks like segmentation and edge detection, the object recognition task is directly practical. Systems implementing face detection - a special case of object recognition - have been implemented in digital cameras. A vision system with the ability to recognize pedestrians, cars, traffic lights, and other objects would be an important component of robotic automobiles.

Most object recognition systems in computer vision follow the same basic recipe. The first ingredient is a database containing many images marked with a label designating the type of object the image shows. Researchers either construct these database themselves, or use a shared benchmark. The second ingredient is a standard learning algorithm such as AdaBoost or the Support Vector Machine [37, 10]. Unfortunately, these classifiers cannot be applied directly to the images, because images are far too high-dimensional. The images must be preprocessed somehow in order to be used as inputs to the learning algorithms. The search for smart preprocessing techniques constitutes one of the main areas of research in the subfield. One common strategy is to define features, which are simple functions of an image or an image region. The feature vector calculated from an image can be thought of as a statistical signature of the image. Hopefully, if a good set of features can be found, this signature will contain enough information to recognize the object, while using only a small number of values (say, 50).

Compared to general purpose tasks like image segmentation, performance on the object recognition task is easy to evaluate: look at the error rate an algorithm achieves on the given database. Often, a shared database is used, so it is possible to directly compare a new algorithm’s performance to previously published results. The question of whether these quantitative results are in fact meaningful is taken up at greater length in Section 3.2.

An important recent paper in the face detection literature is entitled “Robust Real-Time Face Detection” by Viola and Jones [119]. The authors use the basic formula for object recognition mentioned above, with the well-known AdaBoost algorithm as a classifier [37]. AdaBoost is way of combining many weak classifiers that individually perform only slightly better than random into one strong classifier that yields high performance. A key innovation of the paper is the use of a special set of rectangular filter functions as features, along with a computational trick called the Integral Image. The Integral Image allows for extremely rapid computation of the rectangular features. This is important because the goal of the paper is not just to determine if an image contains a face, but also where in the image the face is. This means that the algorithm needs to scan every subwindow to determine if it contains a face or not. Because there are many subwindows, it is essential that each subwindow be processed quickly. Another performance-enhancing trick is called the attentional cascade, which is a modification of the AdaBoost classifier where, if a subwindow looks unpromising after the initial weak classifiers are applied, further analysis is abandoned.

A more recent paper on object recognition is “SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition” by Zhang, Berg, Maire, and Malik 

[126]. This paper combines two methods: -Nearest Neighbor (KNN) and the Support Vector Machine (SVM). KNN works as follows: given a new point to be classified, it finds the samples in the training data that are “closest” to the new point, according to some distance function. To find a guess for the category of a new point, each neighbor “votes” for its own label. In spite of (or because of) this simplicity, KNN works well in practice, and offers several attractive properties. KNN works without any modification for multiclass problems (e.g. where there are many possible labels), and its performance becomes optimal as the number of samples in the training data increases.

The Support Vector Machine is based on the VC-theory discussed in Chapter 2, and is one of the most powerful modern learning algorithms [118]. The SVM works by projecting the training data points into a very high-dimensional feature space, and then finding a hyperplane that separates the points corresponding to different categories with the largest possible margin. Importantly, the hyperplane computed in this way will be determined by a relatively small number of critical training samples; these are called the support vectors. Because the model is determined by the small set of support vectors, it is simple and so, for reasons discussed in Chapter 2, has good generalization power. However, there are a couple of drawbacks involved in using the SVM for object recognition. First, the SVM is designed to solve the binary classification problem, and so applying it to the multiclass problem requires some awkward modifications. Second, the time required for training an SVM (the search for the optimal hyperplane) scales badly with the number of training samples.

So the idea of the paper is to combine the two classifier schemes. To classify a new point, the first step is to find the nearest neighbors in the training data, just as with standard KNN. But instead of using those points to vote for a label, they are used to train a SVM classifier. This classifier is used only for the given query point; to classify the next query point, a new set of neighbors is found and a new SVM is trained. This hybrid scheme combines the good scaling and multiclass properties of KNN with the strong generalization power of the SVM.

The main missing ingredient to complete the algorithm is to define a distance function for the KNN lookup. A distance function simply takes two arguments (typically images), and gives a value indicating how close together they are. The paper discusses several types of distance functions that can be used for various classification problems.

3.2 Evaluation Methodologies in Computer Vision

A defining feature of the hard sciences such as mathematics, physics, and engineering is that they contain rigorous and objective procedures by which to validate new results. A new mathematical theorem is justified by its proof, a new physical theory is justified by the empirical predictions it makes, and a new engineering device is justified by the demonstration that it works as advertised. Because these fields possess rigorous evaluation methods, they are able to make systematic progress. Given the importance of rigorous evaluation, it is worth inquiring into the methods used in computer vision. Under a superficial examination, these methods may appear to be quite rigorous. Certainly they are quantitative, and in some cases seem to provide a clear basis for preferring one method over another. However, a more critical analysis reveals that evaluation is one of the central conceptual difficulties in computer vision.

Since each computer vision task requires its own evaluation procedure (or “evaluator”) to judge candidate solutions, and some have more than one, there are at least as many evaluators as there are tasks. A comprehensive survey is therefore beyond the scope of this book, which instead provides a discussion that should illustrate the basic issues. Before proceeding to the discussion, it is worth noting that, historically at least, the field has exhibited very bad habits regarding the issue of empirical evaluation. Shin et al. note that, of 23 edge detection papers that were published in four journals between 1992 and 1998, not a single one gave results using ground truth data from real images [105]. Even when empirical evaluations are carried out, it is often by a research group that has developed a new technique and is interested in highlighting its performance relative to existing methods.

3.2.1 Evaluation of Image Segmentation Algorithms

The subfield of image segmentation dates back to at least 1978 [85]. In spite of this long history, vision researchers have never agreed on the precise computational definition of the task. Neither have they agreed whether the goal is to emulate a human perceptual process, or to serve as a low-level component for a high-level application. There is also a glaring conceptual difficulty in the task definition, which is that there clearly exist some images that are impossible or meaningless to segment. Given an image of the New York skyline, should one segment the buildings all into one region, or assign them all to their own region? If there is a cloud in the sky, should it be segmented into its own region, or is it conceptually contained within one generic “sky” region? While these kinds of questions linger, there are surely some segmentation questions that can be definitively answered. Shi and Malik show an image of a baseball player diving through the air to make a catch. It seems obvious that the pixels corresponding to the player’s body should be segmented into a separate region from the green grass and the backfield wall. Authors of vision papers present these kinds of test images, along with the algorithm-generated segmentation results, for the reader’s inspection. It is assumed that the reader will confirm the quality of the generated segmentation.

This method is not very rigorous, and everyone is aware of this. A slightly more rigorous methodology involves using automatic evaluation metrics. These metrics, recently surveyed by Zhang 

et al., perform evaluation by computing various quantities directly from an image and the machine-generated segmentation [127]. For example, some metrics compute a measure of intra-region uniformity, since in good segmentations the regions should be mostly uniform. Another type of metric computes a measure of the inter-region disparity; others score the smoothness of the region boundaries. These methods are attractive because they do not require either human evaluation or human-labeled reference segmentations. However, use of such automatic metrics is conceptually problematic, because there is no special reason to believe a given metric correlates well with human judgment. The justification of these metrics is also somewhat circular, since the algorithms themselves often optimize similar quantities. Furthermore, most of the metrics do not work very well. Zhang et al. report an experiment in which 7 out of 9 metrics gave higher scores to machine-generated segmentations than to human-generated ones more than 80% of the time [127].

In 2001, a real empirical benchmark appeared in the form of the Berkeley Segmentation Dataset, which contains “ground truth” obtained by asking human test subjects to segment images by hand [76]. While the approach to evaluation associated with this dataset is an important improvement over previous techniques, several issues remain. One obvious problem with this approach is that it is highly labor-intensive and task-specific, so the ratio of effort expended to understanding achieved seems low. A larger issue is that the segmentation problem has no precisely defined correct answer: different humans will produce substantially different responses to the same image. Even this might not be so bad; one can define an aggregate or average score and plausibly hope that using enough data will damp out chance fluctuations that might cause a low quality algorithm to achieve a high score or vice versa. But still another conceptual hurdle must be cleared: given two segmentations, one algorithm- and one human-generated, there is no standard way to score the former by comparing it to the latter. Some scoring functions assign high values to degenerate responses, such as assigning each pixel to its own region, or assigning the entire image to a single region [74]. The question of how to score a segmentation by comparing it to a human-produced result has become a research problem in its own right, resulting in a proliferation of scoring methods [74, 32, 115]. A more technical but still important issue is the problem of parameter choice. Essentially all segmentation algorithms require the choice of at least one, and usually several, parameters, which strongly influence the algorithm’s output. This complicates the evaluation process for obvious reasons.

3.2.2 Evaluation of Edge Detectors

The task of edge detection is conceptually similar to segmentation, and faces many of the same issues when it comes to empirical evaluation. One interesting paper in this area is promisingly titled “A Formal Framework for the Objective Evalution of Edge Detectors”, by Dougherty and Bowyer [31] (1998). This paper begins with the following remarks:

Despite the fact that edge detection is one of the most-studied problems in the field of computer vision, the field does not have a standard method of evaluating the performance of an edge detector. The current prevailing method of showing a few images and visually comparing subsections of edges images lacks the rigor necessary to find the fine-level performance differences between edge detectors… This lack of objective performance evaluation has resulted in an absence of clear advancement in the “state of the art” of edge detection, or a clear understanding of the relative behavior of different detectors.

To remedy this unfortunate situation, the authors propose a framework for edge detection based on manually labeled “ground truth”. To obtain the ground truth, a human studies an image and labels the edge positions by hand. This strategy is labor intensive: the authors note that labeling a image requires about 3-4 hours. Because of the time requirements, the evaluation is done using only 40 images, which are grouped into four categories: Aerial, Medical, Face, and generic Object. The manual labeling strategy is also somewhat subjective, since the goal is not to find all the edges but only the perceptually salient ones.

Dougherty and Bowyer grapple with the issue, common to almost all vision evaluation research, of parameter choice. Most vision algorithms employ several parameters, and their output can depend on the parameter choices in complicated ways. So if an algorithm performs well on a certain test with one parameter setting, but poorly on the same test when a different setting is used, should the algorithm receive a high score or a low score? Dougherty and Bowyer propose to score an algorithm based on its “best-case” performance. They use an iterative sampling procedure where many parameter settings are tested, and the best setting is used to score the method. Measuring performance is also somewhat problematic, since most algorithms can reduce their false positive rate by increasing their false negative rate and vice versa. The authors solve this by using a Receiver Operating Characteristic (ROC) curve, which shows the relationship between false positives and false negatives, and defining the best performance as the one with the smallest area under the curve.

Based on this evalution scheme, Dougherty and Bowyer rank the performance of six detector algorithms, and conclude that the Canny detector and the Heitger detector are the best. They claim that the ranking is fairly consistent across the image categories. However, it is not clear if this kind of conclusion can be justified on the basis of such a limited dataset. Do 40 images constitute a meaningful sample from the space of images? Another conceptual issue is the connection between human perception and practical applications. Even if an edge detector’s output agrees with human perception, it may not also be useful as a low-level component of a higher-level system.

In addition to these conceptual issues, the evaluation reported by Dougherty and Bowyer appears to have some technical problems, as described in a 2002 paper by Forbes and Draper [36]. These authors point out that the ROC-style evaluator is extremely noisy, because very small changes to the target image can cause the evaluator to assign wildly different scores. This noisiness is caused by the hypersensitivity of the edge detectors to small changes in the target image. Forbes and Draper use a graphics program to generate images, which allows them to probe the detectors’ responses to minor changes in the target image. They mention that one edge detector (the Susan detector) goes from receiving the best score to the worst score when the resolution of the target image is changed from 50-90 pixels to 130-200 pixels. They also highlight one example where a one-pixel change in the resolution of the target image leads to a huge shift in the response of the Canny detector. In general, the Canny detector appears to be highly sensitive to the choice of parameter setting. This implies that the results of Dougherty and Bowyer showing the superiority of the Canny detector were probably just a result of the parameter search scheme they employed. If a detector’s response changes dramatically as a function of the parameters, then it is more likely that there exists some parameter setting for which it receives a high score.

3.2.3 Evaluation of Object Recognition

In theory, a vision researcher can evaluate an object recognition system using a very straightforward procedure: construct a labeled benchmark database, and count the number of errors made by each candidate solution. In practice, as discussed below, there are a number of practical and conceptual pitfalls that can make evaluation difficult. This section describes two well-known benchmark databases, that appeared in 2002-3. There are more recent databases, but at the time of this writing, it is too early to tell whether they have had a strong effect on the field.

In the context of object recognition, the labeled benchmark database actually plays two critical roles. This is because the majority of object recognition systems are based on learning algorithms. The training portion of the database enables the learning algorithms to fit statistical models to the data. The testing portion is used to actually evaluate the system’s performance.

A prototypical example of a benchmark used in computer vision is the UIUC Image Database for Car Detection [1]. The purpose of this database is to evaluate car detector systems, which are obviously an important component of autonomous cars. The UIUC contains only two categories: car and non-car. In this sense it is similar to the one used by Viola-Jones to train their face detection system. The developers constructed the database in the standard way: they went outside and took a lot of photos, about half of which included cars. This is somewhat labor intensive: the UIUC contains only about 1500 images; a database that was twice as big would require about twice as much time to construct. The images are greyscale and of size , which is quite small.

The Caltech 101 is another influential database in the subfield of object recognition. This database is differs from the UIUC one in a number of ways. Caltech 101 was developed to evaluate generic object recognition ability, not a specific detector (such as car- or face-detection). The Caltech 101 is so named because it contains 101 categories; this is far more than was considered by previous research, which used at most six [33]. Furthermore, the developers of the Caltech 101 used a unique process to construct it:

The names of 101 categories were generated by flipping through the pages of the [dictionary], picking a subset of categories that were associated with a drawing. [Then] we used Google Image Search engine to collect as many images as possible for each category. Two graduate students … then sorted through each category, mostly getting rid of irrelevant images (e.g. a zebra-patterned shirt for the “zebra” category) [33].

This method of rapidly generating an image database is quite clever, because it allows a large number of images to be labeled rapidly. However, it does have an important drawback, related to the way images are used in web pages. The Google Image Search engine does not select images based on an analysis of their contents. Instead, it searches the surrounding text for the query term. So if the user searches for “monkey”, then the engine first finds pages that include that term, and then extract and return any images that those pages include. The issue is that a web page describing monkeys will probably include images in which the monkey is prominently displayed in the central foreground. The monkey will not be peripheral, enshadowed, or half-concealed by the leaves of a tree. This implies that the images are actually relatively easy to recognize. Another issue related to this procedure is that it selects a somewhat strange set of categories; Caltech 101 includes categories for “brontosaurus”, “euphonium”, and “Garfield”.

While the Caltech 101 has been very influential, it also has some important shortcomings. Several of these were noted in a paper by Ponce et al. [91] (the et al.includes 12 other well known computer vision researchers). One such shortcoming is that the objects are shown in very typical poses, without occlusion or complex background. Ponce et al. note:

Even though Caltech 101 is one of the most diverse datasets available today in terms of the amount of inter-class variability it encompasses, it is unfortunately lacking in several important sources of intra-class variability. Namely, most Caltech 101 objects are of uniform size and orientation within their class, and lack rich backgrounds [91].

Also, if an average image is constructed by averaging together all the images in a class, this average image will have a very characteristic appearance. For example, the average image of the “menorah” class is very easily recognizable as a menorah.

The Caltech 101 is very diverse, but that diversity comes at a price: many image categories occur only a small number of times. Many systems are trained using on the order of 20 or 30 examples per category. Because of this small amount of per-category data, the field is vulnerable to manual overfitting as described in Chapter 2. This is described by Ponce et al.:

The (relatively) long-time public availability of image databases makes it possible for researchers to fine-tune the parameters of their recognition algorithms to improve their performance. Caltech 101 may, like any other dataset, be reaching the end of its useful shelf life [91].

One odd phenomenon related to the Caltech 101 is that several techniques can exhibit excellent classification performance but very poor localization performance: they can guess that the image contains a chair, but don’t know where the chair is. This is partly because, in the Caltech 101 and several other benchmark databases, the identity of the target object correlates highly with the background. The correlation improves the performance of “global” methods that use information extracted from the entire image. To some extent this makes sense, because for example cows are often found in grassy pastures, but a recognizer that exploits this information may fail to recognize a cow in a photograph of an Indian city. Ponce et al. analyze this issue and conclude that:

This illustrates the limitations as evaluation platforms (sic?) of datasets with simple backgrounds, such as CogVis, COIL-100, and to some extent, Caltech 101: Based on the evaluation presented in this section, high performance on these datasets do not necessarily mean high performance on real images with varying backgrounds [91].

CogVis and COIL-100 are two other object recognition datasets used in computer vision [66, 82]. The fact that good benchmark performance does not equate to good real-world performance illustrates a challenging methodological issue for computer vision. It shows that database development, which one might think of as a rote mechanical procedure, is actually quite difficult and risky. Often a benchmark database will be developed and studied, but the results of that study will not lead to improvements in the state of the art. In other words, each benchmark produces a modest chance of allowing the field to take a small step forward. Given the effort required to construct databases, this implies that the aggregate amount of benchmark development work that will be required to achieve real success in computer vision is actually immense.

There are, of course, newer and more sophisticated databases that attempt to repair some of the problems mentioned above. For example, the successor to Caltech 101, the Caltech 256, shows objects in a diverse range of poses and backgrounds, so that the average image is a homogeneous blob. Another set of methods involve web-based games in which human players are tasked to label images, as part of the game. The games are lightly entertaining, and so can potentially exploit “bored human intelligence” to label a large number of images. At the time of this writing, it is not yet clear whether these methods will provide the necessary impetus for further progress in object recognition.

3.2.4 Evaluation of Stereo Matching and Optical Flow Estimation

This section packages together a discussion of evaluation metrics for stereo matching and optical flow algorithms, because the evaluation schemes are actually conceptually similar. The key property that distinguishes these tasks from others previously considered is that ground truth can be obtained using automatic methods. This ground truth is objective and does not require human input. A clean evaluation can be performed by comparing the ground truth to the algorithm output. Interestingly, there is also a second type of evaluation method that can be used for both the stereo matching and optical flow tasks. This method is based on interpolation, as discussed below.

The methods for obtaining ground truth for these tasks is based on the use of a sophisticated experimental apparatus [4, 101, 102, 112]. For the stereo matching problem, ground truth can be obtained using an apparatus that employs structured light [102]. Here, information about stereo correspondence is inferred from the special patterns of light cast on the scene by a projector. The projector casts several different patterns, such that each pixel can be uniquely identified in both cameras by its particular illumination sequence. For example, if ten patterns are cast, then one pixel might get the code 0010101110 while a nearby pixel gets the code 0010100010. Matching points are found by comparing the pixel codes found in each image. A conceptually similar scheme can be used to obtain ground truth for the optical flow problem. Here, an experimental setup is used in which an object sprinkled with flourescent paint is moved on a computer-controlled motion stage, while being photographed in both ambient and ultraviolet lighting [4]. Since the motion of the stage is known, the actual motion of the objects in the scene can be computed from the reflection pattern of the ultraviolet light.

Once the ground truth has been obtained, it is a conceptually simple matter to evaluate a solution by comparing its output to the correct answer. While the evaluation schemes used for these tasks are methodologically superior to others used in computer vision, they still suffer from an important drawback, which is the difficulty of using the experimental apparatus. The setting up the apparatus requires a nontrivial amount of human labor, which implies that only a small number of image sequences are used. A well-known benchmark, hosted on the Middlebury Stereo Vision Page, contains a total of 38 sequences [100]. It is not clear if the general performance of an algorithm on arbitrary images can be well estimated using such a small number of sequences. Additionally, since most vision algorithms include several parameters, it is hard to rule out manual overfitting as the source of any good performance that might be obtained on the benchmark.

For both the optical flow problem and the stereo matching problem, there exists an evaluation scheme that is strikingly simpler than the ground-truth based method described above [4, 101, 112]. The basic idea here is to use interpolation. For the stereo matching problem, the experimentalists use a trinocular camera system: a set of three cameras mounted on a bar at known inter-camera spacings. This system generates an image set . The peripheral images and are fed into the stereo matching algorithm. If the algorithm is successful in computing the depth map and occlusion points, then it should also be able to infer the image . Thus the evaluation proceeds simply by comparing the algorithm-inferred image to the real image . A variety of scores can be used for the comparison; Baker et al. use:

Where is the magnitude of the image gradient at a certain point.

The interpolation-based scheme for evaluating optical flow algorithms is conceptually identical [4]. The only difference is that now the image sequence is separated in time rather than space. A video camera observes a moving scene, and produces an image sequence. Again, the extremal images and are fed to the optical flow estimation algorithm, which then attempts to infer the central image . Algorithms are evaluated by comparing the guess for the central image to the real image .

These interpolation-based evaluation schemes are far simpler than the scheme based on ground truth, because no special apparatus is necessary to collect the ground truth. Section 3.4.3 shows that the interpolation metric is actually equivalent to a certain kind of compression score.

3.3 Critical Analysis of Field

This section contains a brief critical analysis of the field of computer vision. Before beginning, it is worthwhile to identify the conditions that allow such a critique to be reasonable even in principle. If computer vision were a mature field like traditional computer science, a critique of its philosophical foundations would be an utterly worthless exercise. However, the limitations of modern vision systems suggest that there is a some deep conceptual obstacle hindering progress. Therefore, the critique should be understood not as a disparagement of previous research, but as an attempt to discern the nature of the obstacle. Furthermore, it would be feckless and immature to complain about the limitations of computer vision without proposing some plan to repair those limitations. This book contains exactly such a plan.

3.3.1 Weakness of Empirical Evaluation

The most obvious failing of computer vision is the weakness of its methods of empirical evaluation. This shortcoming is widely recognized and has been lamented by several authors [43, 54]. Many papers attempt to do empirical evaluation by showing the results of a applying an algorithm to a couple of test images. By looking at the system-generated results, the reader is supposed to verify that the algorithm has done a good job. Vision researchers are implicitly arguing that because an algorithm performs well on some small number of test images, it performs well in general. But this argument is clearly flawed. It may be that the researchers hand-picked images on which their algorithm performed well, or got lucky in the selection of those images. Or, more likely, it may mean that the design of the algorithm was tweaked until it produced good results on the test images.

As discussed above, there is a recent trend in computer vision towards the use of benchmark databases, such as the Berkeley Segmentation Dataset and the Caltech 101, for empirical evaluation. These benchmarks suffer from a wide variety of conceptual and practical shortcomings, several of which have already been described. One general issue is the problem of meta-evaluation. There is no guarantee that an evaluator scheme will produce good assessments of the quality of a candidate solution. This implies that it is necessary to conduct a meta-evaluation process, to determine which evaluator produces the best information about the quality of a solution. But then how is the word “best” to be defined? Is it the method that correlates with human judgment? Or is the method that produces the optimum response performance when used as a subcomponent for a higher-level system? Is there any reason to believe these measures will agree? The fact that evaluators themselves could be low-quality exacerbates the fact that they often require a huge investment of time and effort to develop. Say a researcher has developed a new evaluation strategy, that requires a large database of human labeled ground truth. If there is a strong possibility that the evaluator will fail, then the risk-reward ratio for the project may become unacceptably high.

Another shortcoming of evaluation in computer vision is related to the idea of Goodhart’s Law [39]. Goodhart’s original formulation of this law is: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” To see the relevance of this idea for computer vision, assume that under normal conditions, a certain evaluator produces a noisy but informative estimate of the quality of a method. So there is an observed statistical regularity (or correlation) between the evaluation score and the real quality of the solution. Then if methods were developed in perfect ignorance of the evaluator, the evaluator would produce a reasonable ranking of the various methods. In reality what happens is that researchers know about the evaluation schemes, and this knowledge guides the development process. Anyone who wants to assert the quality of a new technique will need to show that it performs well according to the evaluator. Since the regularity is now being used for control purposes, Goodhart’s Law suggests that it will collapse.

The weakness of evaluation in computer vision is strongly related to the fact that the field does not conceive of itself as an empirical science. In empirical sciences, researchers eventually obtain the correct description of a particular phenomenon, and then move on to new problems. Instead of conceiving of their work as empirical science, vision researchers see themselves as producing a suite of tools. Each tool has a particular set of conditions or applications for which it works well. The only real way to evaluate a low-level method such as an edge detection algorithm is to connect it to a real-world application and measure the performance of the latter. To say that one algorithm is better than another would be like saying a screwdriver is better than a hammer. In this mindset, the fact that the field produces a river of solutions to various tasks, without any strong way of validating the quality of those solutions, is not a problem.

3.3.2 Ambiguity of Problem Definition and Replication of Effort

A critical reader of the computer vision literature is often struck by the fact that different authors formulate the same problem in very different ways. The problem of edge detection means something very different to Marr and Hildreth than it does to Canny. Marr and Hildreth seem very concerned with transforming from the continuous world to the symbolic world. Canny, on the other hand, seems primarily concerned with making the edge detection process robust to noise. For Viola and Jones, in the object detection problem it is important to also locate the object in the image by scanning every subwindow; this means the detection for a single subwindow must be extremely fast. In contrast Zhang et al. (SVM-KNN paper), are only concerned with determining an object’s identity, not locating it. This proliferation of formulations and different versions of the problem hinders progress for obvious reasons.

The cause of this ambiguity in problem definition is that computer vision has no standard formulation or parsimonious justification. Compare this to the situation in physics. Physicists employ the same basic justification for all of their research: predict the behavior of a certain system or experimental configuration. While this justification is parsimonious, it nevertheless leads to a wide array of research, because there are a huge number of physical systems and experimental configurations. Research in computer vision has no comparable justification. Vision papers are often justified by a large number of incompatible ideas. Introductory sections of vision papers will often include discussions of psychological phenomena such as mirror neurons, Gestalt psychology, neuroscience of the visual cortex, and so on. They will also often include completely orthogonal practical justifications, arguing that certain low-level systems will be useful for later, high-level applications.

The lack of precise problem definitions leads to an enormous replication of effort. A Google Scholar search for papers with the phrase “image segmentation” in the title returns more than 15,000 hits. Hundreds of different techniques have been applied to the problem, including fuzzy methods, expectation-maximization, robust analysis, level sets, watershed transforms, random walks, neural networks, Gibbs random fields, genetic algorithms, cellular automata, and more. This is all in spite of the fact that the image segmentation problem is not well defined and still has no good evalution scheme. A search for the phrase “edge detection” returns more than 7,000 hits. In principle, this immense proliferation of research is not

a priori bad. In physics, any new unexplained phenomenon might elicit a large number of competing explanations. The difference is that physicists can eventually determine which explanation is the best.

One crucial aspect of the success of the field of physics is that physicists are able to build on top of their predecessors’ work. Thus Newton used ideas originally developed by Galileo and Kepler to construct his theory of mechanics. Newtonian mechanics was then used by Ampere and Faraday in their electrical research. Their laws were combined together and embellished by Maxwell. Maxwell’s theory of electromagnetics then served as an impetus for the theory of special relativity. Each discovery depends on a set of conceptual predecessors. An implication of this is that physics research becomes increasingly esoteric and difficult for nonspecialists to understand. Computer vision does not work like this; researchers rarely build on top of previous work. One might imagine, for example, that recent work in image segmentation might reuse previously discovered edge detection methods. But this is not true, in general: the Felzenszwalb and Huttenlocher paper, published in 2004, uses no concepts developed by other vision researchers. Their paper can be understood by anyone with a bit of background in computer science. The same general theme is true for the other papers described above. They sometimes depend on sophisticated previous results, but those results come from fields like statistics and machine learning, not from computer vision. So the Sun et al. stereo matching paper (2005) depends on the Markov Random Field model, and the Viola-Jones face detection paper (2004) depends on the AdaBoost algorithm, but neither exploits any previous result in computer vision.

3.3.3 Failure of Decomposition Strategy

Several of systems discussed in the previous section perform a task that can be described as basic or low-level. Few people would claim that these systems have practical value as isolated programs. Image segmentation, for example, is rarely useful as a standalone application. Rather, the idea is that these low-level systems will eventually function as components of larger, more sophisticated vision applications. With a few exceptions, these higher-level applications have not yet appeared, but it is plausible to believe they will appear once the low-level components have achived a high level of reliability. So during the modern phase, researchers will develop good algorithms for tasks like image segmentation, edge detection, optical flow estimation, and feature point extraction. Then, in a future era, these algorithms will be packaged together somehow into powerful and practical vision systems that deliver useful functionality.

This strategy, though it may seem plausible in the abstract, is in fact fraught with philosophical difficulty. The issue is that, in advance of any foreknowledge of how the futuristic systems will work, there is no way of knowing what subsystems will be required. Future applications may, plausibly, operate by first applying low-level algorithms to find the important edges and regions of an image, and then performing some advanced analysis on those components. Or they may function in some entirely different way. It is almost as if, by viewing birds, researchers of an earlier age anticipated the arrival of artificial flight, and proposed to pave the way to that application by developing artificial feathers.

3.3.4 Computer Vision is not Empirical Science

Chapter 1 proposed a simple taxonomy of scientific activity involving three categories: mathematics, empirical science, and engineering. Each of these categories produces a different kind of contribution, and each demands a different kind of justification for new research results. The field of computer vision can be classified in the above scheme by analyzing the types of statements it makes and the analytical methods used to validate those statements. Consider the following abbreviated versions of the statements presented in three of the papers mentioned above:

VJ#1: By using rectangular features along with the Integral Image trick, the feature computation process can be sped up dramatically.
VJ#2: The rectangular features, when used with the AdaBoost classifier, produce good face detection results.
SM#1: The formalism leads to a relaxed eigenvector problem, which can be solved using a specialized algorithm.
SM#2: By using the formalism, good segmentation results can be obtained.
SSZ#1: By introducing special hidden variables representing occlusion and discontinuity, and using a Markov Random Field model, the Belief Propagation algorithm can be used to provide approximate solutions to the stereo matching problem.
SSZ#2: The model defined in this way will produce good stereo matching results.

In the above list, the first statement (#1) in each pair is a statement of mathematics. These statements are deductively true and cannot be objected to or falsified. Thus, an important subset of the results produced in the field are mathematical in character. However, most people would agree that computer vision is not simply a branch of mathematics.

The second set of statements in the above list relates somehow to the real world. But these results are best understood as statements of engineering: a particular device (algorithm) performs the task for which it is designed. While these statements do, perhaps, contain assertions about empirical reality, these assertions are never made explicit. Perhaps the Viola-Jones result contains some implication about the visual structure of faces, but the implication is entirely indirect. Furthermore, the Popperian philosophy requires that a theory expose itself to falsification. The vision techniques described above do not do so; no new evidence can appear that will falsify the Shi-Malik approach to image segmentation. If these methods are discarded by future researchers, it will be because some other technique achieved a higher level of performance on the relevant task.

Another key aspect of empirical science mentioned in Chapter 1 is that practitioners adopt a Circularity Commitment to guide their research. Empirical scientists are interested in constructing theories of various phenomena, and using those theories to make predictions about the same phenomena. They feel no special need to ensure that their theories are useful for other purposes. Clearly, vision scientists do not adopt this commitment. Vision scientists never study images out of intrinsic curiousity. Rather, they study images to find ways to develop practical applications such as face detection systems. If an inquiring youngster asks a physicist about the world, the physicist might respond with a long speech involving topics such as atoms, gravitation, conservation laws, quantum superfluidity, and the fact that only one fermion can occupy a given quantum state. But if the youngster puts the same question to a vision scientist, the latter will have very little to say.

The failure of vision scientists to make explicit empirical claims is related to their background and training. Most computer vision researchers have a background in computer science (CS). For this reason, they formulate vision research as the application of the CS mindset to images. To understand this influence, consider the QuickSort algorithm, which is an exemplary piece of CS research. The key innovation behind the research is the design of the algorithm itself. The algorithm is provably correct and works for all input lists. Its quality resides in the fact that, in most cases, it runs more quickly than other sorting methods. Using this kind of research as an exemplar, it is not surprising that computer vision researchers attempt to obtain algorithms for image segmentation or edge detection that work for all images. The influence of CS also explains why vision researchers do not attempt to study the empirical properties of images: the development of QuickSort required no knowledge of the empirical properties of lists.

Two other factors, in addition to the influence of the CS mindset, prevent vision researchers from conducting a systematic study of natural images. First, there is little communal awareness in the field that there exists mechanisms (e.g. the compression principle) that could be used to guide such a study. Second, there is not much reason to believe such a study would actually produce anything of value. In other words, it is not widely apparent that a version of the Reusability Hypothesis would hold for the resulting inquiry. Of course, the nonobviousness of the Reusability Hypothesis was one of the key barriers holding back the advent of empirical science. The argument of this book, then, is that the conceptual obstacle hindering progress in computer vision is simply a reincarnation of one that so long delayed the development of physics and chemistry.

3.3.5 The Elemental Recognizer

Imagine that, perhaps as a result of patronage from an advanced alien race, humanity had come into the possession of computers and digital cameras before the advent of physics and chemistry. Aristotle, in his book “On Generation and Corruption”, wrote that all materials are composed of varying quantities of the four classical elements: air, fire, earth, and water. An ancient Greek vision scientist might quite reasonably propose to build a vision system for the purpose of classifying an object according to its primary and secondary elemental composition.

The system would work as follows. First, the researcher would take many pictures of various everyday objects, such as trees, foods, togas, urns, farm animals, and so on. Then he would enlist the aid of his fellow philosophers, asking them to classify the various objects according to their elemental composition. The other philosophers, having read Aristotle, would presumably be able to do this. They may not agree in all cases, but that should not matter too much, as long as there is a general statistical consistency in the labeling (most people will agree that a tree is made up of earth and water).

Now that this benchmark database is available, the vision philosopher uses it to test prototype implementations of the elemental recognizer system. The philosopher takes the standard approach to building the system. His algorithm consists of the following three steps: preprocessing, feature extraction, and classification. The preprocessing step may consist of thresholding, centering, or filtering the image. The feature extraction step somehow transforms the image from a large chunk of data into simple vector of perhaps 20 dimensions. Then he applies some standard learning algorithm, to find a rule that predicts the elemental composition labels from the feature vectors.

The research on elemental recognition could be justified on various grounds. Some philosophers may claim that it is important to develop automatic elemental composition recognition systems to facilitate progress in other fields. Some thinkers may view it as a good inspiration for new mathematical problems. Purist philosophers may view it as intrinsically interesting to find out the elemental compositions of objects, while others might regard it as an important low level processing task that will be helpful for higher level tasks.

The point of this thought experiment is that this process will work, perhaps rather well, in spite of the fact that the idea of the four classical elements is not even remotely scientific. The standard approach to visual recognition contains no mechanism that will indicate that the elemental categories are not real. Instead of learning something about reality, the system learns to imitate the murky perceptual process which assigns objects to elemental categories.

This idea sounds ridiculous to a modern observer, but only because he knows that Aristotelian conception of element composition is completely false. How could the ancient Greek vision scientists discover this fact? Is it possible that a vision researcher working alone, with no knowledge of modern chemistry or physics, could articulate a principle by which to determine if the elemental composition idea is scientific or unscientific? The scientific philosophy of computer vision can evaluate the ability of various methods to identify elemental composition, but it cannot judge the elemental composition idea itself.

3.4 Comperical Formulation of Computer Vision

This book proposes a new way to carry out computer vision research: apply the Compression Rate Method to a large database of natural images. Computer vision, in this view, becomes the systematic empirical study of visual reality. A compericalvision researcher proceeds by studying the image database, developing a theory describing the structure of the images, building this theory into a compressor, and demonstrating that the compressor achieves a short codelength. This formulation provides a parsimonious justification for research: a theorem, technique, or observation is a legitimate topic of computer vision researcher if it is useful for compressing the image database. The field advances by obtaining increasingly short codes for benchmark databases, and by expanding the size and scope of those databases.

This approach to vision has a number of conceptual and methodological advantages. It allows for clean and objective comparisons between competing solutions. These decisive comparisons will allow the vision community to conduct an efficient search through the theory-space. It allows researchers to use large unlabeled databases, which are relatively easy to construct, instead of the small labeled datasets that are used in traditional evaluation procedures. Because of the large quantity of data being modeled, the theories obtained through the crm can be enormously complex without overfitting the data. Perhaps most importantly, it allows computer vision to become a hard empirical science like physics.

It should be noted that some ingenuity must be exercised in constructing the benchmark databases. Visual reality is immensely complex, and it will be impossible to handle all of this complexity in the early stages of research. Instead, researchers should construct image databases that exhibit a relatively limited amount of variation. Chapter 1 proposed using a roadside video camera to generate a database, in which the main source of variation would be the passing automobiles. Many other setups, that include enough variation to be interesting but not so much that it becomes impossible to handle, can be imagined.

One immediate objection to the proposal is that, while large scale lossless data compression of natural images may be interesting, it has nothing to do with computer vision. The following arguments counter this objection. The key insight is that most computer vision tasks can be reformulated as specialized compression techniques.

3.4.1 Abstract Formulation of Computer Vision

Computer vision is often described as the inverse problem of computer graphics. The typical problem of graphics is to produce, given a scene description wrtten in some description language , the image that would be created if a photo were taken of that scene. The goal of computer vision is to perform the reverse process: to obtain a scene description from the raw information contained in the pixels of the image . This goal can be formalized mathematically by writing where is the image constructed by the graphics program and is a correction image that makes up for any discrepancies. Then the goal is to make the correction image as small as possible:

Here is some cost function which is minimized for the zero image, such as the sum of the squared values of each correction pixel. The problem with this formulation is that it ignores one the major difficulties of computer vision, which is that the inverse problem is underconstrained: there are many possible scene descriptions that can produce the same image. So it is usually possible to trivially generate any target image by constructing an arbitrarily complex description . As an example, if one of the primitives of the description language is a sphere, and the sphere primitive has properties that give its color and position relative to the camera, then it is possible to generate an arbitrary image by positioning a tiny sphere of the necessary color at each pixel location. The standard remedy for the underconstrainment issue is regularization [90]. The idea here is to introduce a function that penalizes complex descriptions. Then one optimizes a tradeoff between descriptive accuracy and complexity:


Where the regularization parameter controls how strongly complex descriptions are penalized. While this formulation works well enough in some cases, it also raises several thorny questions related to how the two cost functions should be chosen. If the goal of the process is to obtain descriptions that appear visually “correct” to humans, then presumably it is necessary to take considerations of human perception into account when choosing these functions. At this point the typical approach is for the practitioner to choose the functions based on taste or intuition, and then show that they lead to qualitatively good results.

It turns out that the regularization procedure can be interpreted as a form of Bayesian inference. The idea here is to view the image as evidence and the description as a hypothesis explaining the evidence. Then the goal is to find the most probable hypothesis given the evidence: