Machine Learning using the Variational Predictive Information Bottleneck with a Validation Set

11/06/2019 ∙ by Sayandev Mukherjee, et al. ∙ 0

Zellner (1988) modeled statistical inference in terms of information processing and postulated the Information Conservation Principle (ICP) between the input and output of the information processing block, showing that this yielded Bayesian inference as the optimum information processing rule. Recently, Alemi (2019) reviewed Zellner's work in the context of machine learning and showed that the ICP could be seen as a special case of a more general optimum information processing criterion, namely the Predictive Information Bottleneck Objective. However, Alemi modeled the model training step in machine learning as using training and test data sets only, and did not account for the use of a validation data set during training. The present note is an attempt to extend Alemi's information processing formulation of machine learning, and the predictive information bottleneck objective for model training, to the widely-used scenario where training utilizes not only a training but also a validation data set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Abstract

Zellner [1] modeled statistical inference in terms of information processing and postulated the Information Conservation Principle (ICP) between the input and output of the information processing block, showing that this yielded Bayesian inference as the optimum information processing rule. Recently, Alemi [2] reviewed Zellner’s work in the context of machine learning and showed that the ICP could be seen as a special case of a more general optimum information processing criterion, namely the predictive information bottleneck objective. However, [2] modeled machine learning as using training and test data sets only, and did not account for the use of a validation data set during training. The present note is an attempt to extend Alemi’s information processing formulation of machine learning, and the predictive information bottleneck objective for model training, to the widely-used scenario where training utilizes not only a training but also a validation data set.

2 Review of Information Processing formulation of Machine Learning

2.1 Introduction and Notation

We will use Alemi’s formulation and notation from [2], with some additional detail for clarity. Consider a data generating process with distribution (PMF or PDF, depending on whether is discrete or continuous-valued, respectively) , which generates the features according to the distribution . We collect samples of in the training set , with the choice of subscript ‘P’ emphasizing that these are past observations. Depending on whether we are testing the performance of a trained model on a test set or deploying a trained model in a production environment to perform inference, we may have a finite or (potentially) infinite set of future (i.e., not seen during training) samples of from the same process (also emphasized by the choice of subscript ‘F’).

2.2 The Predictive Information Bottleneck Objective for Model Training

Model training or “learning” is the extraction of the model parameters from the training set . Viewed from the perspective of information processing, we may see model training as computing, and sampling from, the distribution .

Again from the perspective of information processing, the trained model may be evaluated by how much information the training-derived representation captures about future samples . In other words, we want to find the that maximizes the mutual information :

where for any two random variables

and , their mutual information

is defined as the Kullback-Leibler distance between their joint distribution

and the product of their marginal distributions and :

Alemi [2] proposes to limit the complexity of the model representation obtained from training by incorporating a bottleneck requirement on the mutual information . This yields the following predictive information bottleneck objective on training:

(1)

Applying a Lagrange multiplier to the constraint on , (1) can be rewritten as the unconstrained optimization problem

(2)

Note that there is no constraint on the sign of . Next, from the Markov property of the information processing chain , we have , so we have

(3)

where in the first step we have used the identity

(4)

Combining (2) and (3) then yields the following unconstrained optimization problem equivalent to the predictive information bottleneck objective (1):

(5)

where . Since (5) cannot be solved directly, Alemi employs two variational approximations that are described next.

2.3 Variational Approximations to the Predictive Information Bottleneck Objective

  • Treat as unobserved variables and use a variational approximation to the true distribution , where is a distribution chosen independent of . Denoting by the expectation with respect to (the true distributions of) all the random variables , we have (recalling that and are conditionally independent given ):

    (6)
  • Treat the distribution as a “likelihood” function and use a variational approximation for it given by , for example the factorized form for some selected distribution . Then we can write

    (7)

    where for any random variable , is its entropy:

2.4 Variational Formulation of Predictive Information Bottleneck Objective

From (6) and (7) we therefore have, for every choice of , variational approximate marginal distribution , and variational approximate likelihood function , the following upper bound on the objective function of (5):

(8)

Note that is a constant outside our control. Since the exact problem (5) cannot be solved directly, we can simply select based on some external criteria and solve the following problem, whose optimum value yields an upper bound on the exact objective in (5):

(9)
(10)

where

and we emphasize that the expectation in (9) is with respect to the true distributions of . Note that the use of the variational approximation has eliminated the dependence on the distribution of .

Finally, it follows from (10) that the optimum distribution is given by

For , the objective in (10) can be identified as the ICP postulated by Zellner [1], and the optimum is the Bayesian inference derived from the variational marginal and likelihood and respectively:

3 The Modified Predictive Information Bottleneck Objective

3.1 Shortcomings of the Predictive Information Bottleneck Objective (1)

Alemi’s formulation [2] of the predictive information bottleneck objective in (1) has the following shortcomings:

  • Recall that in (1), the Lagrange multiplier may be positive or negative-valued. However, all the interpretations in [2, Table 1] of the equivalent optimization problem (5), which is written in terms of , are for , i.e., , though there is no explanation offered in [2] of why only values of in this range make sense for this problem.

  • The quantity is never used after the equality constraint is incorporated into the problem formulation in (1). Further, is simply selected as per other criteria, and not set so as to achieve this equality .

3.2 Predictive Information Bottleneck Objective with Inequality Constraint

The fundamental reason for the above shortcomings is the requirement of equality in the bottleneck constraint.

We therefore propose to modify the predictive information bottleneck objective from (1) to one that has an inequality constraint between , the mutual information of the trained model and the training set, and a pre-selected threshold :

(11)

Note that this formulation imposes a bottleneck on model performance, such that only models that extract a certain threshold level of information from the training set clear the bottleneck. From the Karush-Kuhn-Tucker (KKT) theorem, the optimization problem (1) now changes to

(12)
(13)

where we have used (3) to go from (12) to (13), and because by definition. Note that this immediately resolves the interpretability issues111Except for the uninteresting case where , i.e., we ignore the data altogether while training (see [2, Table 1]). that arose with the equality bottleneck constraint formulation above. Note also that the objective function of (13) is the same as the left hand side of (8). Then the variational approximations discussed in Sec. 2.3 can be applied as before, resulting in the same variational objective function (10).

Further, the KKT theorem requires that the optimum for (12) satisfy the complementary slackness condition

(14)

which implies that if the training step achieves , then the optimum is zero, i.e., the optimum in (13) is unity. From the discussion at the end of Sec. 2.4 we have the following result.

Theorem 1

If training step is successful, i.e., we achieve , then the optimum information processing for (10) is Bayesian inference:

(15)

where and are as defined in Sec. 2.3.

4 Extension to Model Training with a Validation Set

4.1 Introduction and Notation

The machine learning model is trained on a training set

. For training classical machine learning models like support vector machines, decision trees, and random forests, which are not as computationally expensive to train as deep learning models, this training set

can be used for cross-validation to validate the performance of the model during training in order to guard against overfitting the training set. However, -fold cross-validation requires the model to be trained from scratch times, which often imposes an unacceptable computational burden when the model is a deep learning one. Thus, when training deep learning models, it is widespread practice to set aside a portion of the available training data as a validation set, use the remainder of the training data as the training set to train the model, and, as the model trains, validate its performance on the validation set.

In the notation of the previous section, if the available training data comprises the samples , the training set is now where , with the rest of the training data comprising the validation set .

4.2 Predictive Information Bottleneck Objective with Validation Set

As in Sec. 3.2, we will require the trained model to clear the bottleneck of a threshold on mutual information between the model and the training set:

(16)

In addition, we will require the model, after being trained on the training set , to clear another threshold of mutual information between the model and the validation set (analogous to the performance requirement on mutual information between the model and the test set ):

(17)

From (4) we see that (16) and (17) together yield

(18)

which is equivalent to requiring the bottleneck condition with threshold on the mutual information between the model and the augmented training set created from and . However, instead of requiring the single condition (18), we shall impose the stricter requirement to satisfy the two conditions (16) and (17).

The new optimization problem is now

(19)

In (19) the distribution incorporates the dependence of the trained model on the validation set. This dependence arises because we train by running a learning algorithm on (a subset of) the training set, validate the (partially) trained model on the validation set, run the learning algorithm again on (another subset of) the training set, validate the (partially) trained model again on the validation set, and so on until the performance of the model on both the training and validation sets satisfies some criteria which we represent here by (16) and (17) respectively.

4.3 Modified Predictive Information Bottleneck Objective with Inequality Constraint and Validation Set

From the Markov property of the information processing chain , we have , so we have

(20)

As with (11), the KKT theorem lets us rewrite (19) as follows:

(21)
(22)

where we use (20) to go from (21) to (22), and , . Further, the optimum must satisfy the complementary slackness conditions

(23)
(24)

4.4 Variational Approximations

  • Treat as unobserved variables and use a variational approximation to the true distribution , where is a distribution chosen independent of . Following the notation of Sec. 2.3 and the same steps as in the derivation of (6), we obtain

    (25)
  • Treat the distribution as a “likelihood” function and use a variational approximation for it given by , for example the factorized form for some selected distribution . Then we obtain (7):

  • Treat the conditional distribution as a conditional likelihood function and use a variational approximation for it given by , for example the factorized form for some selected distribution . Then we obtain

    (26)

    where for any two random variables and ,

    Note that the process of training on along with periodic validation using the validation set induces a dependence between and through the model , hence it is not true in general that even if and are assumed independent.

4.5 Variational Formulation of the Predictive Information Bottleneck
Objective with Inequality Constraint and a Validation Set

From (25), (7), and (26), the objective function in (22) is upper-bounded for each choice of as follows:

(27)

where in the final step we use the identity that for any two random variables and ,

We treat the last three terms on the right hand side of (27) as constants outside our control. Since the exact problem (22) cannot be solved directly, we simply select , , , , and based on some external criteria and solve the following problem, whose optimum value yields an upper bound on the exact objective in (22):

(28)
(29)

where

or

and we emphasize that the expectation in (28) is over the true distributions of , , . Note that the use of the variational approximation has eliminated the dependence on the distribution of .

It follows from (29) that the optimum distribution is given by

Further, from (23) and (24), we can prove the following result in the same way as Theorem 1.

Theorem 2

If training with the validation set is successful, i.e., we achieve and , then the optimum information processing for (29) is Bayesian inference:

(30)

where , , and are as defined in Sec. 4.4.

References