Generalization in Machine Learning via Analytical Learning Theory

02/21/2018 ∙ by Kenji Kawaguchi, et al. ∙ MIT Université de Montréal 0

This paper introduces a novel measure-theoretic learning theory to analyze generalization behaviors of practical interest. The proposed learning theory has the following abilities: 1) to utilize the qualities of each learned representation on the path from raw inputs to outputs in representation learning, 2) to guarantee good generalization errors possibly with arbitrarily rich hypothesis spaces (e.g., arbitrarily large capacity and Rademacher complexity) and non-stable/non-robust learning algorithms, and 3) to clearly distinguish each individual problem instance from each other. Our generalization bounds are relative to a representation of the data, and hold true even if the representation is learned. We discuss several consequences of our results on deep learning, one-shot learning and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. Because of the differences in the assumptions and the objectives, the proposed learning theory is meant to be complementary to previous learning theory and is not designed to compete with it.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical learning theory provides tight and illuminating results under its assumptions and for its objectives (e.g., Vapnik 1998; Mukherjee et al. 2006; Mohri et al. 2012

). As the training datasets are considered as random variables, statistical learning theory was initially more concerned with the study of

data-independent bounds based on the capacity of the hypothesis space (Vapnik, 1998), or the classical stability of learning algorithm (Bousquet and Elisseeff, 2002). Given the observations that these data-independent bounds could be overly pessimistic for a “good” training dataset, data-dependent bounds have also been developed in statistical learning theory, such as the luckiness framework (Shawe-Taylor et al., 1998; Herbrich and Williamson, 2002), empirical Rademacher complexity of a hypothesis space (Koltchinskii and Panchenko, 2000; Bartlett et al., 2002), and the robustness of learning algorithm (Xu and Mannor, 2012).

Along this line of reasoning, we notice that the previous bounds, including data dependent ones, can be pessimistic for a “good” problem instance, which is defined by a tuple of a true (unknown) measure, a training dataset and a learned model (see Section 3 for further details). Accordingly, this paper proposes a learning theory designed to be strongly dependent on each individual problem instance. To achieve this goal, we directly analyse the generalization gap (difference between expected error and training error) and datasets as non-statistical objects via measure theory. This is in contrast to the setting of statistical learning theory wherein these objects are treated as random variables.

The non-statistical nature of our proposed theory can be of practical interest on its own merits. For example, the non-statistical nature captures well a situation wherein a training dataset is specified and fixed first (e.g., a UCL dataset, ImageNet, a medical image dataset, etc.), rather than remaining random with a certain distribution. Once a dataset is actually specified, there is no randomness remaining over the dataset (although one can artificially create randomness via an empirical distribution). For example,

Zhang et al. (2017) empirically observed that given a fixed (deterministic) dataset (i.e., each of CIFAR10, ImageNet, and MNIST), test errors can be small despite the large capacity of the hypothesis space and possible instability of the learning algorithm. Understanding and explaining this empirical observation has become an active research area (Arpit et al., 2017; Krueger et al., 2017; Hoffer et al., 2017; Wu et al., 2017; Dziugaite and Roy, 2017; Dinh et al., 2017; Bartlett et al., 2017; Brutzkus et al., 2017).

For convenience within this paper, the proposed theory is called analytical learning theory, due to its non-statistical and analytical nature. While the scope of statistical learning theory covers both prior and posterior guarantees, analytical learning theory focuses on providing prior insights via posterior guarantees; i.e., the mathematical bounds are available before the learning is done, which provides insights a priori to understand the phenomenon and to design algorithms, but the numerical value of the bounds depend on the posterior quantities. A firm understanding of analytical learning theory requires a different style of thinking and a shift of technical basis from statistics (e.g., concentration inequalities) to measure theory. We present the foundation of analytical learning theory in Section 3 and several applications in Sections 4-5.

2 Preliminaries

In machine learning, a typical goal is to return a model via a learning algorithm given a dataset such that the expected error with respect to a true (unknown) normalized measure is minimized. Here,

is a function that combines a loss function

and a model

; e.g., in supervised learning,

, where is a pair of an input and a target . Because the expected error is often not computable, we usually approximate the expected error by an empirical error with a dataset . Accordingly, we define . One of the goals of learning theory is to explain and validate when and how minimizing is a sensible approach to minimizing by analyzing the generalization gap, and to provide bounds on the performance of on new data.

2.1 Discrepancy and variation

In the following, we define a quality of a dataset, called discrepancy, and a quality of a function, called variation in the sense of Hardy and Krause. These definitions have been used in harmonic analysis, number theory, and numerical analysis (Krause, 1903; Hardy, 1906; Hlawka, 1961; Niederreiter, 1978; Aistleitner et al., 2017). This study adopts these definitions in the context of machine learning. Intuitively, the star-discrepancy evaluates how well a dataset captures a normalized measure , and the variation in the sense of Hardy and Krause computes how a function varies in total w.r.t. each small perturbation of every cross combination of its variables.

2.1.1 Discrepancy of dataset with respect to a measure

For any , let be a closed axis-parallel box with one vertex at the origin. The local discrepancy of a dataset with respect to a normalized Borel measure on a set is defined as

where is the indicator function of a set . Figure 1 in Appendix A.1 shows an illustration of the local discrepancy and related notation. The star-discrepancy of a dataset with respect to a normalized Borel measure is defined as

2.1.2 Variations of a function

Let be the partial derivative operator; that is, is the partial derivative of a function with respect to the -th coordinate at a point . Let . A partition of with size is a set of finite sequences () such that for . We define a difference operator with respect to a partition as: given a function and a point in the partition (for ),

where is the subsequent point in the partition along the coordinate . Let . Given a function of variables, let be the function restricted on variables such that , where for all . That is, is a function of with other original variables being fixed to be one.

The variation of on in the sense of Vitali is defined as

where is the set of all partitions of . The variation of on in the sense of Hardy and Krause is defined as

For example, if is linear on its domain, because for all . The following proposition might be helpful in intuitively understanding the concept of the variation as well as in computing it when applicable. All the proofs in this paper are presented in Appendix B.

Suppose that is a function for which exists on . Then,

If is also continuous on ,

3 A basis of analytical learning theory

This study considers the problem of analyzing the generalization gap between the expected error and the training error . For the purpose of general applicability, our base theory analyzes a more general quantity, which is the generalization gap between the expected error and any empirical error with any dataset (of size ) including the training dataset with . Whenever we write , it is always including the case of ; i.e., the case where the model is evaluated on the training set.

With our notation, one can observe that the generalization gap is fully and deterministically specified by a problem instance , where we identify an omitted measure space by the measure for brevity. Indeed, the expected error is defined by the Lebesgue integral of a function on a (unknown) normalized measure space as , which is a deterministic mathematical object. Accordingly, we introduce the following notion of strong instance-dependence: a mathematical object is said to be strongly instance-dependent in the theory of the generalization gap of the tuple if the object is invariant under any change of any mathematical object that contains or depends on any , any , or any such that and . Analytical learning theory is designed to provide mathematical bounds and equations that are strongly instance-dependent.

3.1 Analytical decomposition of expected error

Let be any (unknown) normalized measure space that defines the expected error, . Here, the measure space may correspond to an input-target pair as for supervised learning, the generative hidden space of for unsupervised / generative models, or anything else of interest (e.g., ). Let be the pushforward measure of under a map . Let be the image of the dataset under . Let be the total variation of a measure on

. For vectors

, let , where denotes the product order; that is, if and only if for . This paper adopts the convention that the infimum of the empty set is positive infinity.

Theorem 3.1 is introduced below to exploit the various structures in machine learning through the decomposition where is the output of a representation function and outputs the associated loss. Here, can be any intermediate representation on the path from the raw data (when ) to the output (when ). The proposed theory holds true even if the representation is learned. The empirical error can be the training error with or the test/validation error with .

For any , let be a set of all pairs such that is a measurable function, is of bounded variation as , and

where indicates the Borel -algebra on . Then, for any dataset pair (including ) and any ,

  1. [label=()]

  2. where , and

  3. for any such that is right-continuous component-wise,

    where , and is a signed measure corresponding to as and .

The statements in Theorem 3.1 hold for each individual instance , for example, without taking a supremum over a set of other instances. In contrast, typically in previous bounds, when asserting that an upper bound holds on for any

(with high probability), what it means is that the upper bound holds on

(with high probability). Thus, in classical bounds including data-dependent ones, as gets larger and more complex, the bounds tend to become more pessimistic for the actual instance (learned with the actual instance ), which is avoided in Theorem 3.1.

The bound and the equation in Theorem 3.1 are strongly instance-dependent, and in particular, invariant to hypothesis space and the properties of learning algorithm over datasets different from a given training dataset (and ).

Theorem 3.1 together with Remark 3.1 has an immediate practical consequence. For example, even if the true model is contained in some “small” hypothesis space , we might want to use a much more complex “larger” hypothesis space in practice such that the optimization becomes easier and the training trajectory reaches a better model at the end of the learning process (e.g., over-parameterization in deep learning potentially makes the non-convex optimization easier; see Dauphin et al. 2014; Choromanska et al. 2015; Soudry and Hoffer 2017). This is consistent with both Theorem 3.1 and practical observations in deep learning, although it can be puzzling from the viewpoint of previous results that explicitly or implicitly penalize the use of more complex “larger” hypothesis spaces (e.g., see Zhang et al. 2017).

Theorem 3.1 does not require statistical assumptions. Thus, it is applicable even when statistical assumptions required by statistical learning theory are violated in practice.

Theorem 3.1 produces bounds that can be zero even with (and ) (as an examples are provided throughout the paper), supporting the concept of one-shot learning. This is true, even if the dataset is not drawn according to the measure . This is because although such a dataset may incur a lager value of (than a usual i.i.d. drawn dataset), it can decrease in the generalization bounds of . Furthermore, by being strongly instance-dependent on the learned model , Theorem 3.1 supports the concept of curriculum learning (Bengio et al., 2009a). This is because curriculum learning directly guides the learning to obtain a good model , which minimizes by its definition.

3.2 Additionally using statistical assumption and general bounds on

By additionally using the standard i.i.d. assumption, Proposition 3.2 provides a general bound on the star-discrepancy that appears in Theorem 3.1. It is a direct consequence of (Heinrich et al., 2001, Theorem 2).

Let be a set of i.i.d. random variables with values on and distribution . Then, there exists a positive constant such that for all and all , with probability at least ,

where with . Proposition 3.2 is not probabilistically vacuous in the sense that we can increase to obtain , at the cost of increasing the constant in the bound. Forcing still keeps constant without dependence on relevant variables such as and . This is because if is large enough such that , which depends only on the constants. Using Proposition 3.2, one can immediately provide a statistical bound via Theorem 3.1 over random . To see how such a result differs from that of statistical learning theory, consider the case of . That is, we are looking at classic training error. Whereas statistical learning theory applies a statistical assumption to the whole object , analytical learning theory first decomposes into and then applies the statistical assumption only to . This makes strongly instance-dependent even with the statistical assumption. For example, with and , if the training dataset satisfies the standard i.i.d. assumption, we have that with high probability,

(1)

where the term is strongly instance-dependent.

In Equation (1), it is unnecessary for to approach infinity in order for the generalization gap to go to zero. As an extreme example, if the variation of aligns with that of the true (i.e., is constant), we have that and the generalization gap becomes zero even with . This example illustrates the fact that Theorem 3.1 supports the concept of one-shot learning via the transfer of knowledge into the resulting model .

For the purpose of the non-statistical decomposition of , instead of Theorem 3.1

, we might be tempted to conduct a simpler decomposition with the Hölder inequality or its variants. However, such a simpler decomposition is dominated by a difference between the true measure and the empirical measure on an arbitrary set in high-dimensional space, which suffers from the curse of dimensionality. Indeed, the proof of Theorem

3.1 is devoted to reformulating via the equivalence in the measure and the variation before taking any inequality, so that we can avoid such an issue. That is, the star-discrepancy evaluates the difference in the measures on high-dimensional boxes with one vertex at the origin, instead of on an arbitrary set.

The following proposition proves the existence of a dataset with a convergence rate of that is asymptotically faster than in terms of the dataset size . This is a direct consequence of (Aistleitner and Dick, 2014, Theorem 2).

Assume that is a surjection. Let be any (non-negative) normalized Borel measure on . Then, for any , there exists a dataset such that

This can be of interest when we can choose to make small without increasing too much; i.e., it then provides a faster convergence rate than usual statistical guarantees. If (which is true in many practical cases), we can have by setting , because there exists a bijection between and . Then, although the variation of is unbounded in general, might be still small. For example, it is still zero if the variation of aligns with that of the true in this space of .

3.3 General examples

The following example provides insights on the quality of learned representations:

Let where is a map of any learned representation and is a variable such that there exists a function satisfying (for supervised learning, setting always satisfies this condition regardless of the information contained in ). For example, may represent the output of any intermediate hidden layer in deep learning (possibly the last hidden layer), and may encode the noise left in the label . Let be a map such that . Then, if , Theorem 3.1 implies that for any dataset pair (including ),

Example 3.3 partially supports the concept of the disentanglement in deep learning (Bengio et al., 2009b) and proposes a new concrete method to measure the degree of disentanglement as follows. In the definition of , each term can be viewed as measuring how entangled the

-th variables are in a space of a learned (hidden) representation. We can observe this from the definition of

or from Proposition 2.1.2 as: , where is the -th order cross partial derivatives across the -th variables. If all the variables in a space of a learned (hidden) representation are completely disentangled in this sense, for all and is minimized to . Additionally, Appendices A.5 and A.6 provide discussion of the effect of flatness in measures and higher-order derivatives.

One of the reasons why analytical learning theory is complementary to statistical learning theory is the fact that we can naturally combine the both. For example, in Example 3.3, we cannot directly adopt the probabilistic bound on from Section 3.2, if does not satisfy the i.i.d. assumption because depends on the whole dataset . In this case, to analyze , we can use the approaches in statistical learning theory, such as Rademacher complexity or covering number. To see this, consider a set such that and is independent of . Then, by applying Proposition 3.2 with a union bound over a cover of , we can obtain probabilistic bounds on with the log of the covering number of for all representations . As in data-dependent approaches (e.g., Bartlett et al. 2017, Lemma A.9), one can also consider a sequence of sets such that , and one can obtain a data-dependent bound on via a complexity of .

The following example establishes the tightness of Theorem 3.1 (i) with the 0-1 loss in general, where is an inclusion map: Theorem 3.1 (i) is tight in multi-class classification with 0-1 loss as follows. Let . Let be an identity map. Then, and for all . Then, the pair of and satisfies the condition in Theorem 3.1 as and are measurable functions. Thus, from Theorem 3.1, (see Appendix B.5 for this derivation), which establishes a tightness of Theorem 3.1 (i) with the 0-1 loss as follows: for any dataset pair (including ),

The following example applies Theorem 3.1 to a raw representation space and a loss space : Consider a normalized domain and a Borel measure on . For example, can be an unknown hidden generative space or an input-output space (). Let us apply Theorem 3.1 to this measure space with and . Then, if , Theorem 3.1 implies that for any dataset pair (including ) and any , .

Example 3.3 indicates that we can regularize in some space to control the generalization gap. For example, letting the model be invariant to a subspace that is not essential for prediction decreases the bound on . As an extreme example, if with some generative function and noise (i.e., a setting considered in an information theoretic approach), being invariant to results in a smaller bound on . This is qualitatively related to an information theoretic observation such as in (Achille and Soatto, 2017).

4 Application to linear regression

Even in the classical setting of linear regression, recent papers (

Zhang et al. 2017, Section 5; Kawaguchi et al. 2017, Section 3; Poggio et al. 2017, Section 5) suggest the need for further theoretical studies to better understand the question of precisely what makes a learned model generalize well, especially with an arbitrarily rich hypothesis space and algorithmic instability. Theorem 3.1 studies the question abstractly for machine learning in general. As a simple concrete example, this section considers linear regression. However, note that the theoretical results in this section can be directly applied to deep learning as described in Remark 4.2.

Let be a training dataset of the input-target pairs where . Let be the learned model at the end of any training process. For example, in empirical risk minimization, the matrix is an output of the training process, . Here, is any normalized measurable function, corresponding to fixed features. For any given variable , let be the dimensionality of the variable . The goal is to minimize the expected error of the learned model .

4.1 Domains with linear Gaussian labels

In this subsection only, we assume that the target output

is structured such that where is a zero-mean random variable independent of . Many columns of can be zeros (i.e., sparse) such that uses a small portion of the feature vector . Thus, this label assumption can be satisfied by including a sufficient number of elements from a basis with uniform approximation power (e.g., polynomial basis, Fourier basis, a set of step functions, etc.) to the feature vector up to a desired approximation error. Note that we do not assume any knowledge of .

Let be the (unknown) normalized measure for the input (corresponding to the marginal distribution of ). Let and be the input part and the (unknown) input-noise part of the same training dataset as , respectively. We do not assume access to . Let be the -th column of the matrix .

Assume that the labels are structured as described above and . Then, Theorem 3.1 implies that

(2)

where , , , and

Theorem 4.1 is tight in terms of both the minimizer and its value, which is explained below. The bound in Theorem 4.1 (i.e., the right-hand-side of Equation (2)) is minimized (to be the noise term only) if and only if (see Appendix A.4 for pathological cases). Therefore, minimizing the bound in Theorem 4.1 is equivalent to minimizing the expected error or generalization error (see Appendix A.4 for further details). Furthermore, the bound in Theorem 4.1 holds with equality if . Thus, the bound is tight in terms of the minimizer and its value.

For and , we can straightforwardly apply the probabilistic bounds under the standard i.i.d. statistical assumption. From Proposition 3.2, with high probability,