Manifold Hypothesis

What is the Manifold Hypothesis?

The Manifold Hypothesis states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space.

This hypothesis is better explained in examples, however.

Let's tackle the "embedded manifold" bit first, before we get to how it applies to machine learning and data.

A manifold is really just a technical term that is used to classify spaces of arbitrary dimension. For every whole number there exists a flat space called Euclidean space that has characteristics very similar to the cartesian plane. For example the Pythagorean theorem holds and thus the shortest distance between points is a straight line (in contrast, this is not true on a circle or sphere).  The dimension of a Euclidean space is essentially the number of (independent) degrees of freedom - basically, the number of (orthogonal) directions one can "move" inside the space).  A line has one such dimension, an infinite plane has 2, and an  infinite volume has 3, and so n.  A manifold is essentially a generalization of Euclidean space such that locally (small areas) are approximately the same as Euclidean space but the entire space fails to be have the same properties of Euclidean space when observed in its entirety.  This theoretical framework always mathematicians and other quantitively motivated scientists to describe spaces, like spheres, tori (donut-shaped spaces) and mobius bands, in a precise way and even allows a whole plethora of mathematical machinery, including calculus, to be used in a meaningful way.  The upshot is that now the classes of spaces upon which calculus now makes sense is expanded to include spaces that may be curved in arbitrary ways, or even have holes like the torus. 

So now we take this idea, and apply it to high-dimensional data.   Imagine we are interested in classify all (black and white) mages with mxn pixels.  Each pixel has a numerical value, and each can vary depending on what the image is, which could correspond to anything from an award wining photo to meaningless noise.  The point is that we have mxn degrees of freedom so we can treat an image of mxn pixels as being a single point in living in a space (manifold)  of dimension N =mn,  Now, imagine the set of all mxn imagines that are photos of Einstein.  Clearly we now have some restriction on the choice of values for the pixels if we want the images to be photos of Einstein rather than something else.  Obviously random choices will not generate such images.  Therefore, we expect there to be less freedom of choice and hence:

The manifold hypothesis states that that this subset should actually live in an (ambient) space of lower dimension, in fact a dimension much, much smaller than N

Why This Hypothesis is Important in Artificial Intelligence? 

The Manifold Hypothesis explains (heuristically) why machine learning techniques are able to find useful features and produce accurate predictions from datasets that have a potentially large number of dimensions ( variables).    The fact that the actual data set of interest actually lives on in a space of low dimension, means that a given machine learning model only needs to learn to focus on a few key features of the dataset to make decisions.  However these key features may turn out to be complicated functions of the original variables.  Many of the algorithms behind machine learning techniques focus on ways to determine these (embedding) functions.

MIT has an excellent paper on testing the hypothesis. We also recommend checking out Colah’s blog.