“Deep” unsupervised feature learning methods (Bengio, 2009; Hinton and Salakhutdinov, 2006; Vincent et al., 2008; Tenenbaum et al., 2000; LeCun, 2013) present a challenge to learning theory. This paper takes up this challenge, of explaining when and why these techniques work. Is it possible to learn generic features from data in an unsupervised fashion that perform well in a multitude of tasks, and if so how do we learn these features? Following the work of the statistician Lucien Le Cam (Lecam, 2011, 1964, 1974) and utilizing the techniques of statistical decision theory, in particular the comparison of statistical experiments (Blackwell, 1951, 1953; Torgersen, 1991; Ferguson, 1967) we show (theorem 3) that it is possible to construct generic features from data if and only if one can find a encoder/decoder pair
with low probability of reconstruction error. Furthermore, the worst case difference in performance of the best decision rule that uses such features versus the best decision rule that uses the raw data is bounded above by the probability of reconstruction error. We also show that we can learn this encoder/ decoder pairing in a hierarchical fashion and that the probability of reconstruction error of such a“stacked” system is bounded by the sum of the probability of reconstruction errors of each layer.
While our approach is abstract, the ultimate pay off will be a novel inequality (theorem 2) that provides a characterization of when generic feature learning is possible. This inequality coupled with the concept of deficiency (Lecam, 2011; Torgersen, 1991) (to be explained in the paper) illuminates the algorithms used in deep learning and provides means to judge the generic quality of the features learnt by such methods.
2 The General Learning Problem
For all of the following assume that all of the measure spaces and so on are finite. This does not restrict any of the results, rather it allows for a cleaner presentation free of measure theoretic technicalities as well as boundedness and existence concerns.
A learning problem is a quintuple . is a set of possible “true hypothesis” or unknowns. While we cannot observe directly, we can observe data in some set . is a relationship between the two sets and called the experiment. tells us what data we expect to see if is the true hypotheses. Ultimately we are required to make a decision by choosing an action
, and our performance is measured by a loss function. We view the loss as an integral part of a learning problem and as such do not place any restrictions on it other than boundedness. As is usual in statistical learning, for our possible relationships we use markov kernels (conditional probability assignment/stochastic matrices).
A Markov kernel is a function from to , the set of probability distributions on
, the set of probability distributions on.
For arbitrary sets and , denote by , the set of all Markov kernels from to . As we can represent
by vectors inwith positive entries we have . As such one can represent a Markov kernel as an matrix of positive entries where the sum of all entries in each column is equal to . It is easily verified that is a closed convex subset of , the set of all matrices.
A function defines a Markov kernel with , a point mass distribution on . For every measure space , there are two special Markov kernels, the identity (or completely informative) Markov kernel from the identity function , and the completely uninformative Markov kernel from the function from to a one element set .
From a prior distribution and a Markov kernel
we can construct a joint distribution. Using the matrix vector representation, this is achieved by post multiplying with a diagonal matrix with on the diagonal, . This is no different to the standard product rule .
Given a prior distribution and a Markov kernel we denote the Markov kernel obtained by Bayes rule by .
A learning problem can more compactly be represented as the pair where can be inferred from the type signatures of and . We measure the size of loss functions by
2.1 Decision Rules
Upon observing data we are required to relate to a set of actions by some other Markov kernel known as a (randomized) decision rule.
We are judged on the quality of the composed relation .
Definition (Composition) Suppose and are Markov kernels. Then we can compose and yielding by matrix multiplication
A Markov kernel provides a function by matrix multiplication. To calculate we identify with a vector and with a matrix and use matrix multiplication. This function is convex linear.
2.2 Risk and Value: Ranking Decision Rules and Learning Problems
Given a learning problem one can rank decision rules using the full Bayes risk
Here is a prior distribution on which reflects which hypotheses we feel are more or less likely to be true. Note that both and are convex sets and that is convex bilinear (the same as bilinear but restricted to convex combinations). Alternately, taking a supremum over the prior yields the max risk.
Ranking Learning Problems. We also rank the difficulty of learning problems. The greatest challenge in a learning problem comes from the fact we can not use an arbitrary decision rule . Rather, we are restricted to a certain subset of that “factors through” . We are only allowed to use the data we see.
Definition 2 (Factoring Through)
Suppose we have two Markov kernels and . We say that factors through (written ) if there exists a Markov kernel such that . Denote by
If then can be thought of as with extra noise . The reader is directed to section 1 of the appendix for more properties of factoring through. In this notation . If we have
We assign a value
to a learning problem, with lower value being better. Taking a supremum of the value over the prior yields the minimax risk. The value is the risk of the best possible decision rule for the learning problem at hand.
Bayes Decision Rules are Optimal. If we use to order decision rules then the best is found by using Bayes rule
where , . Hence is concave. This results allows us to parametrize the action set by , by taking
effectively properising the loss function. is a proper loss (Reid and Williamson, 2011; Dawid, 2007; Grünwald and Dawid, 2004; Parry et al., 2012). There are deep connections between and , we review some of these in section two of the appendix.
Connections to other Information Measures.
There are many connections between and different information measures present in the literature.
Definition For a convex the -information of a set of distributions is
-informations are a multi distribution extension of divergences and are used in certain generalizations of rate-distortion theory where they produce better bounds than the standard techniques (Ziv and Zakai, 1973; Zakai and Ziv, 1975; Reid and Williamson, 2011; Garcia-Garcia and Williamson, 2012). For suitable choices of one can recover more known measures of information such as mutual information.
Theorem For all experiments , loss functions and priors , the gap between then the value of and the least informative experiment is a -information for suitable
We direct the reader to Reid and Williamson (2011); Garcia-Garcia and Williamson (2012). By the bijections presented in these two papers, one can replace by these divergences in all that follows. In particular any can be replaced with a for suitable , with no effect on the result. The reader is directed to section 3 of the appendix for proof.
Here we present some examples of familiar learning problems phrased in this more abstract language. Normally there is a distinction between learning algorithms, something that takes a data set of
instance-label pairs and produces a classifier, and adecision rule that is the learnt classifier. Here we do not make such a distinction. Both learning algorithms and decision rules produce actions, hence we only use the term decision rule.
Example 1 (Classification)
and a Markov kernel is then a pair of distributions , on . Normally and a decision rule picks the corresponding label for a given observed . Different losses could be used, eg the 0-1 loss .
Example 2 (Supervised Learning)
There is a space of labels and a space of covariates with . the space of joint distributions on with the map that sends each distribution to its n-fold product. is then some set of classifiers (eg linear hyperplanes/kernel machines and so on). A decision rule then produces a classifier
is then some set of classifiers (eg linear hyperplanes/kernel machines and so on). A decision rule then produces a classifierfrom n pairs . For example empirical risk minimization algorithms, , pick the classifier that minimizes the empirical loss on the observed training set. Many suitable losses exists but normally is the misclassification probability of the classifier when used against the distribution .
Example 3 (Generalized Supervised Learning)
and , we observe instance label pairs and instances. then maps each joint distribution to a product of copies of itself and copies of its marginal distribution over instances.
Example 4 (Active Learning)
could be anything with , length n sequences in some set . Each active learning policy determines a different
. Each active learning policy determines a different.
3 Feature Learning
Starting from a learning problem , feature learning methods aim to extract features . One then bases all decisions on these features.
These methods swap the original learning problem with . Normally the space is smaller/ of lower dimension than and aims at presenting a “compressed” view of the information contained in
. Features can be used for several reasons including communication/ storage constraints, increased performance (ie by implementing decision rules based on Z rather than X directly), knowledge discovery and to avoid “curse of dimensionality” problems.
3.1 Supervised Feature Learning
There has been much attention in the Machine Learning literature on supervised feature learning techniques, wereand the prior are fixed. These methods construct features by minimizing the feature gap
There is now a general framework for solving such problems based largely on variations of the Blahut-Arimoto Algorithm from Rate Distortion theory (Banerjee et al., 2005; Tishby et al., 1999; Cover and Thomas, 2012). For particular choices of and
these methods reproduce many clustering methods such as k-means. We review these methods in section 4 of the appendix. These feature learning methods are not general enough for our purposes as they rely on both the experiment and loss. For example if we wish tolearn from data given by pairs, then we are required to learn the features after we have learnt the experiment . Ideally we would like to learn a feature map independently from so that learning is just as beneficial as learning no matter what is.
3.2 Generic Feature Learning
For many learning problems, a large amount of unlabelled/loosely labelled data is readily available. For example, with any problem involving images one only has to enter some basic search queries into google to be presented with millions of instances. One of the main arguments of the deep learning community is that while this data may not be of direct use in learning classifiers, it can be of great use in learning feature representations. There exists many methods in the literature to learn features from unlabelled data.
In line with these methods we consider the following relaxation of the supervised feature learning problem. We assume that and the prior are allowed to vary leaving only fixed, with one restriction. We assume that there is enough unlabelled data collected from the marginal distribution on
that we are able to form an accurate estimate of. We consider all learning problems and priors that are consistent with this information about ie with . We then seek to find features so that the value of is as close to as possible, no matter what what are. To ensure that minor differences in value are not exploited by multiplying the loss function by a large constant, we penalize the size of the loss function by .
Definition 3 (Generic Features)
Fix a measure space and a distribution . are generic features of quality for if for all learning problems and priors with we have
Ideally we want to make as small as possible, and if is then our features do not ever decrease the value.
For our more relaxed problem the value of our features is effectively
Luckily supremums like these have been tackled in theoretical statistics particularly in the work of Lucien Le Cam (Lecam, 2011, 1964, 1974). In his 1964 paper Le Cam coined the deficiency distance as an extension of David Blackwell’s ordering of experiments Blackwell (1953, 1951) and as a means to provide an approximate version of the statistical notion of sufficiency. This quantity was used in his later work to form a metric not just on probability distributions but on experiments, and in particular allows one to calculate supremums over all loss functions and priors with fixed . We introduce this quantity (the deficiency) in the next section.
4 Approximate Factoring Through and Deficiency
Suppose and are Markov kernels where does not factor through . We measure the degree to which fails to factor through by the weighted directed deficiency (Torgersen, 1991).
is the variational divergence between the distributions , a standard metric on probability distributions (see section 5 of the appendix for properties). Calculating weighted directed deficiencies is a convex (actually linear) optimization problem. One has
is linear in . Since the variational divergence is convex in , is also convex in , because is the composition of a linear function and a convex function. Hence determining weighted directed deficiencies is a
minimization problem. Fast methods exist for solving this problem (eg the well known simplex method of linear programming).
Taking a supremum over the prior yields the directed deficiency,
where the second follows from the minimax theorem (Komiya, 1988). For the sake of checking whether it suffices to use the weighted directed deficiency and a prior that does not put zero probability on any . In this case if and only if (Torgersen, 1991). The weighted deficiency, and deficiency are respectively
if and only if and , when is isomorphic to written . The deficiency distance is a true metric on the space of experiments (modulo isomorphic experiments). A proof of this is included in the appendix.
4.1 Relation to Risk
Factoring through and approximate factoring through are deeply related to the worst case difference in performance between two learning problems with the same as the loss is varied. Here we state the three theorems that highlight the connections between factoring through and risk (Torgersen, 1991). Fix and two experiments and .
Theorem (Information Processing) If then for any loss function and prior
In particular the information processing theorem implies that .
Theorem (Blackwell-Sherman-Stein) if and only if for all loss functions and priors .
Theorem (Randomization) Fix , and . if and only if for all loss functions .
These three theorems allow one to move between decision theoretic notions such as risk and value to probability theoretic notions such as factoring through. For example the original definition of sufficiency can be interpreted in terms of factoring through.
Theorem Fix an experiment and a function . f is a sufficient statistic if .
By the Blackwell-Sherman-Stein theorem we have an equivalent condition for sufficiency in terms of value.
Theorem Fix an experiment and a function .Then is a sufficient statistic if for all and .
Isomorphic experiments always have the same value, no matter what the loss function or the set of actions. Approximately isomorphic experiments, ones where is small, always have approximately the same value. Due to the similarities between learning features and sufficiency statistics, it should be of no surprise that tools for working with approximate sufficiency appear in feature learning.
The Randomization theorem is an example of an approximate notion in probability theory (here approximate sufficiency) has a dual approximate notion in terms of risk.
For all experiments and all priors
This result is an improvement and generalization of the result contained in Liese (2012) that applied only to binary experiments, , and held with inequality. The proof is included in the appendix. We utilize this theorem and the randomisation theorem heavily in the following section.
4.2 Reductions via Factoring Through
Approximate factoring through can be used to transform decision rules for one learning problem to rules for another, with a provable bound on the performance of this decision rule. Suppose is a decision rule used for the learning problem . For another learning problem , we can construct a decision rule from a Markov kernel
. Furthermore if then
By taking infimums over we obtain the smallest in the above.
5 Analysis of Generic Feature Learning via Deficiency
Assume one has enough data in some measure space to form a good estimate of the marginal distribution . We wish to construct generic features for . This is equivalent to finding a that minimizes . One might imagine that this means finding for each the worst , but this is not the case.
For all experiments , for all measure spaces and for all feature maps
No matter which feature map we use, the worst learning problem we can pit against it is the one that asks you to reconstruct directly from the features. The proof is straightforward and is included in the appendix. It hinges on the representation of in terms of average posterior Bayes risk and the Randomization theorem. By definition and as
where from lines (1) to (2) we have used one of the equivalent forms of the variational divergence listed in the appendix. Hence is equal to twice the minimal possible average reconstruction error from the encoder and the prior . This means that finding the best generic features for the data involves finding the that gives the learning problem the highest value . We term this problem the reconstruction problem.
Fix a prior and an . constitutes generic features of quality for if and only if for the reconstruction problem , one can find a decision rule with
For a proof see appendix. If we optimize over both the encoder and the decoder , finding
we obtain a variant of the popular autoencoder algorithm from deep learning (Vincent et al., 2008).
Of course one can always take in which case no real feature learning is done and no performance is lost. However, it is more instructive to set to a measure space of smaller size/ dimension than so that the feature learning extracts (provably) useful patterns in . Performing the joint minimization is a non-convex problem.
5.1 Relation to other Feature Learning Methods
Infomax. When faced with a choice of feature maps , the Infomax principle (Bell and Sejnowski, 1995; Linsker, 1989) dictates that you should choose the features that minimize the conditional entropy between data and features, . By the Hellman-Raviv inequality from information theory (Hellman and Raviv, 1970) we have
meaning the Infomax principle is minimizing an upper bound of the reconstruction error.
Manifold Learning. Under the assumption that has support on some manifold manifold learning methods aim to extract this manifold and provide a parametrization (Silva and Tenenbaum, 2002; Belkin and Niyogi, 2003). If we are able to learn this manifold then any coordinate system would constitute generic features.
Sparse Coding. Much like manifold learning, sparse coding also attempts to find lower dimensional structure in (Lee et al., 2006; Olshausen and Field, 1997). Here is chosen to have higher dimension than , however image of the feature map should comprise only of sparse vectors, those with few non-zero entries. If is injective on the support of then are generic features.
5.2 Learning Feature Hierarchies
One of the tenets of the deep learning paradigm is that features should be learnt in a hierarchical fashion. One should first find patterns in through a feature map and then find patterns in and so on. We construct a chain of feature maps
with final feature space and final feature map given by the composition of all maps in the chain. Proceeding in this fashion allows greater control over the feature spaces . For example the first could be of similar (but still lower) size than , perhaps before a big drop off in the middle of the chain. If we can learn each of the iteratively it also makes searching for features easier. For example, perhaps the first three feature maps have low probability of reconstruction error but not the fourth. In this situation at least we still have a good feature map given by the composition of the first three mappings. We have for any chain of feature maps
where (1) follows as the weighted deficiency satisfies a triangle inequality and (2) follows from repeated application of theorem 2. The reconstruction error of the entire system is bounded by the sum of reconstruction errors at each step of the chain. This means we can learn a feature mapping iteratively, by first learning patterns in , then in and so on. This is exactly the process that occurs in a Deep Belief Network (Hinton and Salakhutdinov, 2006)
5.3 Supervised Feature learning can work when Generic Feature Learning Fails
We present two examples where one can not learn generic features, however we can learn experiment/loss specific features.
Experiment Specific Features. Let with and given by the product of normal distributions with mean
and variance. It is easy to verify that the sample mean is a sufficient statistic meaning that at least for this experiment we can greatly compress the information contained in . However, if we take as a prior for a normal distribution of mean and variance , then the marginal distribution will not be concentrated on a set of smaller dimension nor have any particularly interesting structure. Hence we can not find interesting generic features in this case.
Experiment and Loss Specific Features. Let with a normal distribution centred on as in the figure below.
For this experiment , and a uniform prior , the best decision if as and otherwise as . It is easy to show that , all we need is the output of . However if we change the loss to a cost sensitive loss where say misclassifying a is more costly than a , we no longer have .
5.4 Alternate Reconstruction Problems
The reconstruction problem that is required to be solved to construct generic features is the most difficult one we can pose. To perform well in this problem we are required to reconstruct each exactly. This discards other interesting structure the set may have. For example if is image data a different loss function , perhaps one elicited from psychological tests of what humans perceive to be different images is more appropriate. While these are valid points, we remind the reader that this is a first step in understanding these methods, and making any extra assumptions about and its structure is exactly what we are trying to avoid. However, the Hellman Raviv inequality does give means of bounding the value with the value of different reconstruction problems.
6 Concluding Remarks and Future Work
We have defined generic features and have provided a characterization of when it is possible to learn them. In doing so we have illuminated some popular feature learning methods including autoencoders, deep belief networks and the Infomax principle.. We have moved from supervised feature learning methods
where are all fixed to
with almost nothing fixed. Equation shows how difficult and general finding generic features is. It is reasonable to argue that in practice on does not require features that work for all experiments, all losses and all priors, which by existing results implies a quantification over all -informations. This begs the question of which experiments, loss functions and priors to consider. We might not require all the information in to be maintained, just enough to suit our purposes. This is analogous to the the problem of formalizing the notion of how much information is contained in an experiment. As argued long ago by Morris Degroot (DeGroot, 1962), even if one is doing an experiment to “gain information”, eventually one does something with this “information” by choosing how to act, the consequence of which will be measured by some loss. Hence a more general theory of feature learning needs to be able to control the sensitivity to the loss, allowing one to move from supervised feature learning to generic feature learning.
A starting point would be to take fixed and lying in some subset of as occurs in robust statistics (Huber, 2011), or to allowing small perturbations in the loss. Deficiency can possibly play a role in the development of algorithms to learn features when we take these restricted supremums. There is scope to develop new quantities and theorems analogous to those for deficiency where instead of a supremum over all losses, one takes a supremum over some restricted subset. At present this is an open and uncharted area of both machine learning and theoretical statistics.
7.1 More Properties of Factoring Through
Lemma Factoring through has the following properties. For all sets and Markov kernels and we have
For (1) if then we have
which is obviously in . For (2) note that for any Markov kernel one has . Hence all kernels factor through the identity. For (3) take any and recall that . Hence
which is obviously in Note that comprises the constant Markov kernels, ones that map each to the same distribution on .
One can view factoring through as adding noise. By showing that , we are showing that is composed with extra noise
For some intuition on what factoring through looks like below is a plot of four binary experiments ().
We have that:
The second factors through the first.
The third does not factor through the first (nor the first through the third).
The fourth factors through the first and vice versa (it is just a shifted version of the first).
Suppose and are two Markov kernels. If and then we say that is isomorphic to written . Isomorphic Markov kernels can appear quite different. For example suppose that is an exponential family distribution ( in this case are the parameters for the family). If is the sufficient statistics for the family then even though they appear quite different, and that it appears may throw away lots of information. From a statistical point of view isomorphic Markov kernels are the same.
7.2 -Information and Value
Theorem For all experiments , loss functions and priors , the gap between then the value of and the least informative experiment is a -information for suitable
where is concave. Let , , and . Then and
We need to show that is concave. Note that where
prepends to and and multiplies by the prior, and
As is affine, is concave if is. For all and
where we have used the concavity of . Therefore is concave. Finally
with convex. This completes the proof.
7.3 Proper Loss Functions
Definition A loss function is proper if for all
Any loss function can be properized.
Theorem Let be a loss. Define
where we arbitrarily pick an if there are multiple. Then is proper.
It is possible that by using this trick we remove actions . However, for the purpose of calculating Bayes risks we do not require these actions. From , one can define a regret
which measures how suboptimal the best action is to play against the distribution is when played against the distribution . One does not need knowledge of the function to construct , rather one only needs knowledge of the Bayes risk
From this one can reconstruct , and hence for the purposes of calculating Bayes risks. This is achieved by taking the 1-homogenous extension of
Theorem For a concave
is a proper loss.
The regret from a proper loss is equal to the Bregman divergence defined by and ,
Theorem For all concave and
All of these properties show that we only need knowledge of to compute Bayes risks and values.
7.4 Supervised Feature Learning
For a given experiment and loss function , supervised feature learning methods aim to minimize the feature gap
The feature gap satisfies
where is the regret induced by .
For the proof we use the more familiar probability theory notation with , , and so on. One has