Zellner  modeled statistical inference in terms of information processing and postulated the Information Conservation Principle (ICP) between the input and output of the information processing block, showing that this yielded Bayesian inference as the optimum information processing rule. Recently, Alemi  reviewed Zellner’s work in the context of machine learning and showed that the ICP could be seen as a special case of a more general optimum information processing criterion, namely the predictive information bottleneck objective. However,  modeled machine learning as using training and test data sets only, and did not account for the use of a validation data set during training. The present note is an attempt to extend Alemi’s information processing formulation of machine learning, and the predictive information bottleneck objective for model training, to the widely-used scenario where training utilizes not only a training but also a validation data set.
2 Review of Information Processing formulation of Machine Learning
2.1 Introduction and Notation
We will use Alemi’s formulation and notation from , with some additional detail for clarity. Consider a data generating process with distribution (PMF or PDF, depending on whether is discrete or continuous-valued, respectively) , which generates the features according to the distribution . We collect samples of in the training set , with the choice of subscript ‘P’ emphasizing that these are past observations. Depending on whether we are testing the performance of a trained model on a test set or deploying a trained model in a production environment to perform inference, we may have a finite or (potentially) infinite set of future (i.e., not seen during training) samples of from the same process (also emphasized by the choice of subscript ‘F’).
2.2 The Predictive Information Bottleneck Objective for Model Training
Model training or “learning” is the extraction of the model parameters from the training set . Viewed from the perspective of information processing, we may see model training as computing, and sampling from, the distribution .
Again from the perspective of information processing, the trained model may be evaluated by how much information the training-derived representation captures about future samples . In other words, we want to find the that maximizes the mutual information :
where for any two random variablesand , their mutual information
is defined as the Kullback-Leibler distance between their joint distributionand the product of their marginal distributions and :
Alemi  proposes to limit the complexity of the model representation obtained from training by incorporating a bottleneck requirement on the mutual information . This yields the following predictive information bottleneck objective on training:
Applying a Lagrange multiplier to the constraint on , (1) can be rewritten as the unconstrained optimization problem
Note that there is no constraint on the sign of . Next, from the Markov property of the information processing chain , we have , so we have
where in the first step we have used the identity
where . Since (5) cannot be solved directly, Alemi employs two variational approximations that are described next.
2.3 Variational Approximations to the Predictive Information Bottleneck Objective
Treat as unobserved variables and use a variational approximation to the true distribution , where is a distribution chosen independent of . Denoting by the expectation with respect to (the true distributions of) all the random variables , we have (recalling that and are conditionally independent given ):
Treat the distribution as a “likelihood” function and use a variational approximation for it given by , for example the factorized form for some selected distribution . Then we can write
where for any random variable , is its entropy:
2.4 Variational Formulation of Predictive Information Bottleneck Objective
From (6) and (7) we therefore have, for every choice of , variational approximate marginal distribution , and variational approximate likelihood function , the following upper bound on the objective function of (5):
Note that is a constant outside our control. Since the exact problem (5) cannot be solved directly, we can simply select based on some external criteria and solve the following problem, whose optimum value yields an upper bound on the exact objective in (5):
and we emphasize that the expectation in (9) is with respect to the true distributions of . Note that the use of the variational approximation has eliminated the dependence on the distribution of .
Finally, it follows from (10) that the optimum distribution is given by
3 The Modified Predictive Information Bottleneck Objective
3.1 Shortcomings of the Predictive Information Bottleneck Objective (1)
Recall that in (1), the Lagrange multiplier may be positive or negative-valued. However, all the interpretations in [2, Table 1] of the equivalent optimization problem (5), which is written in terms of , are for , i.e., , though there is no explanation offered in  of why only values of in this range make sense for this problem.
The quantity is never used after the equality constraint is incorporated into the problem formulation in (1). Further, is simply selected as per other criteria, and not set so as to achieve this equality .
3.2 Predictive Information Bottleneck Objective with Inequality Constraint
The fundamental reason for the above shortcomings is the requirement of equality in the bottleneck constraint.
We therefore propose to modify the predictive information bottleneck objective from (1) to one that has an inequality constraint between , the mutual information of the trained model and the training set, and a pre-selected threshold :
Note that this formulation imposes a bottleneck on model performance, such that only models that extract a certain threshold level of information from the training set clear the bottleneck. From the Karush-Kuhn-Tucker (KKT) theorem, the optimization problem (1) now changes to
where we have used (3) to go from (12) to (13), and because by definition. Note that this immediately resolves the interpretability issues111Except for the uninteresting case where , i.e., we ignore the data altogether while training (see [2, Table 1]). that arose with the equality bottleneck constraint formulation above. Note also that the objective function of (13) is the same as the left hand side of (8). Then the variational approximations discussed in Sec. 2.3 can be applied as before, resulting in the same variational objective function (10).
Further, the KKT theorem requires that the optimum for (12) satisfy the complementary slackness condition
4 Extension to Model Training with a Validation Set
4.1 Introduction and Notation
The machine learning model is trained on a training set
. For training classical machine learning models like support vector machines, decision trees, and random forests, which are not as computationally expensive to train as deep learning models, this training setcan be used for cross-validation to validate the performance of the model during training in order to guard against overfitting the training set. However, -fold cross-validation requires the model to be trained from scratch times, which often imposes an unacceptable computational burden when the model is a deep learning one. Thus, when training deep learning models, it is widespread practice to set aside a portion of the available training data as a validation set, use the remainder of the training data as the training set to train the model, and, as the model trains, validate its performance on the validation set.
In the notation of the previous section, if the available training data comprises the samples , the training set is now where , with the rest of the training data comprising the validation set .
4.2 Predictive Information Bottleneck Objective with Validation Set
As in Sec. 3.2, we will require the trained model to clear the bottleneck of a threshold on mutual information between the model and the training set:
In addition, we will require the model, after being trained on the training set , to clear another threshold of mutual information between the model and the validation set (analogous to the performance requirement on mutual information between the model and the test set ):
which is equivalent to requiring the bottleneck condition with threshold on the mutual information between the model and the augmented training set created from and . However, instead of requiring the single condition (18), we shall impose the stricter requirement to satisfy the two conditions (16) and (17).
The new optimization problem is now
In (19) the distribution incorporates the dependence of the trained model on the validation set. This dependence arises because we train by running a learning algorithm on (a subset of) the training set, validate the (partially) trained model on the validation set, run the learning algorithm again on (another subset of) the training set, validate the (partially) trained model again on the validation set, and so on until the performance of the model on both the training and validation sets satisfies some criteria which we represent here by (16) and (17) respectively.
4.3 Modified Predictive Information Bottleneck Objective with Inequality Constraint and Validation Set
From the Markov property of the information processing chain , we have , so we have
4.4 Variational Approximations
Treat the distribution as a “likelihood” function and use a variational approximation for it given by , for example the factorized form for some selected distribution . Then we obtain (7):
Treat the conditional distribution as a conditional likelihood function and use a variational approximation for it given by , for example the factorized form for some selected distribution . Then we obtain
where for any two random variables and ,
Note that the process of training on along with periodic validation using the validation set induces a dependence between and through the model , hence it is not true in general that even if and are assumed independent.
4.5 Variational Formulation of the Predictive Information Bottleneck
Objective with Inequality Constraint and a Validation Set
where in the final step we use the identity that for any two random variables and ,
We treat the last three terms on the right hand side of (27) as constants outside our control. Since the exact problem (22) cannot be solved directly, we simply select , , , , and based on some external criteria and solve the following problem, whose optimum value yields an upper bound on the exact objective in (22):
and we emphasize that the expectation in (28) is over the true distributions of , , . Note that the use of the variational approximation has eliminated the dependence on the distribution of .
It follows from (29) that the optimum distribution is given by
-  A. Zellner, “Optimal Information Processing and Bayes’s Theorem,” The American Statistician, vol. 42, no. 4, pp. 278–280, Nov. 1988.
-  A.A. Alemi, “Variational Predictive Information Bottleneck,” https://arxiv.org/abs/1910.10831.