The aim of the current paper is to extend description length methodology (which we use as a broader term than MDL) to model selection for practical machine learning methods. We consider a supervised learning problem with featuresand labels . Given a set of training data we want to find a predictor of . Here
is a a set of parameters that are estimated from the training data, andis a set of hyperparameters that are chosen; these are typically the model order, e.g., number of hidden units and layers in neural networks, but also quantities like regularization parameters and early stopping times (Bishop, 2006). The goal is to minimize the test error, or generalization error,
for some loss function. However, only the empirical loss (risk) is available:
Using MDL for model selection in learning has been considered before, e.g., Grünwald (2011); Watanabe (2013); Watanabe & Roos (2015); Kawakita & Takeuchi (2016); Alabdulmohsin (2018). In this paper we directly relate a type of description length and generalization error (Theorem 1), using the principles of Fogel & Feder (2018)
, and this provides a strong rationale for using description length for learning. We then use this theory to develop practical methods for model selection, in particular in neural networks and deep learning.
We consider a supervised learning problem with features and labels . The data
is governed by a probability law, where does not depend on .
We are given a training set which we assume is iid from the distribution . We use the notation
to denote the whole training set. The problem we consider is, based on the training data, to estimate the probability distributionso as to minimize the log-loss or cross-entropy
The expectation here is over both test data and the training set, , with respect to the distribution , for a fixed . We will discuss other loss functions later.
2.1 Universal Source Coding and Learning
In this section we assume that the data is from a finite alphabet. Based on the training data we want to find a good estimated probability law , which need not be of the type , and consider as in (3) the log-loss
Importantly, we can interpret as a codelength as follows. By a codelength we mean the number of bits required to represent the data without loss (as when zipping a file). First the encoder is given the training data from which it forms ; this is shared with the decoder. Notice that this sharing is done ahead of time, and does not contribute to the codelength. Next, the encoder is given new data . The decoder knows but not . The encoder encodes using (using an algebraic coder (Cover & Thomas, 2006)), and the decoder, knowing , should be able to decode without loss. The codelength averaged over all training and test data is then within a few bits (Cover & Thomas, 2006). Since this is based on training data, we call this the learned codelength.
The goal is minimize the codelength , equivalently the log-loss. The codelength depends on the true probability distribution, i.e., , but we would like an estimator that is reasonably good for all . In order to do so we first define the regret (or redundancy) of an estimator as
where is the conditional entropy and is the relative entropy or Kulbach-Leibner distance (Cover & Thomas, 2006). The regret is the difference in codelength between the trained coder and a omniscient coder/decoder knowing the actual value of . Minimizing codelength is equivalent to minimizing regret. A reasonable goal is to minimize the worst case regret over all , and we therefore define the optimum estimator as
where is some set, and correspondingly the minimax regret
We now define
This can be considered the average codelength when the distribution of data is given by and the coding distribution is used. This is similar to the problem setup considered in Fogel & Feder (2018).
A related problem to the above is universal coding of the training data itself. In this case we assume the decoder knows the features in the training data but not the corresponding labels ; the task is to communicate these to the decoder. Again, we want to find a good estimated probability law and consider the universal codelength
The expectation here is over the training data only. Notice that in this case, as opposed to learned codelength, the decoder does not know which the encoder is using, and some bits are therefore needed to encode this information, either explicitly or implicitly. This is a key difference with learned coding, see Fig.1.
Similarly to (5) we define the regret
(by the iid assumption ) and
Again, we can interpret as the average codelength to encode when the distribution of data is given by and the coding distribution from (7) is used. This is a problem that has a long history in universal source coding, e.g., Shamir (2006). A difference from traditional universal source coding is that there are features known to both encoder and decoder, but this does not change the problem fundamentally. Therefore, many results from universal source coding can be used. Given a specific training sequence we can use a universal source coding algorithm (Shamir, 2006) to encode , denoting the resulting codelength (which depends on the specific ). For a good source coder we have
Thus, although might not be known, we can find good estimates.
The main result of the paper is the following theorem relating learned coding and universal source coding
Assume that the set of distributions is compact and convex. Then the learned codelength is bounded by
An alternative way to think of the problem is that there is an unknown prior over . We can then define
It is clear that . We can write the expectation explicitly as
This is concave (linear) in and convex in (Cover & Thomas, 2006) and the optimization is over compact sets. Therefore, by the minimax theorem (Von Neumann, 2007), . Similarly, for universal source coding , which is a well-known fact (Shamir, 2006)111The fact that there are features in addition to labels does not make an essential difference..
To find the universal codelength we need to find an optimum coding distribution for a given prior . This is the same as the optimum Bayes estimator Scharf (1990), which is given by
For learned coding, the expectation is over the training and the single test data ; we can append the to and denote this by . The optimum coding distribution for the learned codelength is given by Bayes rule
where the joint probability distributions are
Then for a given prior and true parameter we have
For the first term here we consider as combined training data . Now
The importance of the theorem is that it allows us to find learned codelength (i.e., generalization error) in terms of universal codelength. To find generalization error directly, one would need additional data, whereas the universal codelength is a property of the training data itself.
2.2 Use for model selection
Consider selection between two models and with probability laws and . A model selection rule is is a decision rule . For a specific decision rule, let be the probability of choosing model when model is true, which in general depends on as well as the distribution of the features . Suppose that that data is generated by model and that chooses the correct model. The generalization error for log-loss is then given by (4) as and the regret by (5) as . If choses the wrong model, the generalization error is and the regret . We define the conditional regret of as
That is, is the regret of when the model is . As in Section 2.1 we consider minimax regret, and an optimum decision rule would minimize the maximum regret, , i.e., minimax hypothesis testing (Scharf, 1990). Finding the optimum decision rule is in general impossible, so our goal is to find good practical decision rules. However, one can notice that usually optimum minimax detectors have (Scharf, 1990), so a good decision rule should have .
Specifically, we will develop decision rules based on the theory in Section 2.1. A reasonable decision rule is to choose the model that has the smallest predicted generalization error. There is of course no guarantee it will lead to optimum minimax performance. Since the generalization error can be bounded by the differential description length by Theorem 1, one could use the rule
where is the length of the output of a universal source coder for model .
We call differential description length (DDL).
The issue with using (9) directly is that is a noisy estimate of , and that as a difference of two noisy estimates therefore is quite inaccurate. We instead suggest to use for some integer ,
We can write , an average of the differential over the last samples. In experiments we have seen that the performance is quite insensitive to .
We will illustrate how the above rule works on a simple example. The data
is (iid) binary given by a conditional probability distribution(and marginal ). Under model , is independent of , while under is dependent on . For model there is a single unknown parameter , while for there are two unknown parameters . From Shamir (2006) we can conclude that the description length is
except for some small terms. Here is the entropy calculated for the empirical distribution of the training data.
Theoretical analysis of even this simple model is very complex, even though we have closed-form expressions (11) for codelength, as it is not easy to calculate (8). We will therefore limit ourselves to a numerical maximization over the parameters in (8).
Fig. 2 shows the regret as a function of for both full description length (i.e., essentially ) and differential description length when is optimized. As mentioned above, a good decision rule should have . For both methods we conclude it seems like , where is independent of , which is a reasonable expression of . As the main point, we see that the minimax regret is smaller for DDL than for full description length.
The remaining issue is how to choose . Fig. 3 shows the regret as a function of . The main conclusion is that unless we choose very small or very large, the exact value has little effect. It seems that a good simple choice could be .
3 Hyperparameter Selection in Machine Learning
In order to use description length for machine learning, the machine learning methods need to be able to code data. For example, for discrete labels softmax output used in neural networks can be interpreted as a probability which can be input to an algebraic coder (Cover & Thomas, 2006), and the negative logarithm of the probability therefore can be interpreted as a codelength, within a few bits.
If the alphabet for are the reals, encoding (exactly) requires an infinite number of bits. We can still argue that (4) and (6) are actual codelengths when we use a pdf for as follows. We will assume a fixed point representation of the reals with a (large) finite number, , bits after the period, and an unlimited number of bits prior to the period as in Rissanen (1983). Assume that the data is distributed according to a pdf . Then the number of bits required to represent is given by
When we use description length, we are only interested in comparing codelengths, so the dependency on cancels out.
With the above, in general we can write the codelength to encode training data as
where denotes the hyperparameters. This is a codelength, but it requires the decoder to know (both encoder and decoder are assumed to know ). MDL (Rissanen, 1978, 1983, 1986; Grunwald, 2007) therefore additionally encodes ; since encoding the exact value requires an infinite number of bits, instead an approximation is encoded, and the total codelength is minimized
where is the number of bits required to encode , which can be either explicit or implicit. The minimizing is usually close to the maximum likelihood solution , and can be taken as expressing how well the model fits the given observation.
Now, in machine learning the aim is to minimize generalization error (1), not as such fitting a model. Rather than using MDL directly, we can use it through Theorem 1 by relating generalization error and universal codelength. In this context, in (14) can indeed be thought of as a universal codelength given a certain model class in the sense of Section 2.1, and it can therefore be used in Theorem 1 to estimate generalization error.
In this rule, is the codelength of a universal source coder of the sequence . Many universal source coders and MDL methods are sequential, for example Lempel-Ziv (Ziv & Lempel, 1977, 1978) and CTW (Willems et al., 1995) and predictive MDL (Rissanen, 1986). For such methods, the decoder decodes , , and uses that information to decode repeatedly. Therefore the number of bits used to encode is the same whether or not is encoded by itself, or as the beginning of a longer sequence . As a consequence, for such methods , where is the codelength of encoding when the encoder and decoder are given the side information of . The expression has the advantages that it is less noisy than , mainly because each term in the difference has its own uncertainty. We will therefore use
This methodology has further advantages, which will be discussed shortly.
The above methodology is specifically aimed at minimizing generalization error in terms of log-loss. This is useful as the log-loss in some sense dominates all other loss functions (Painsky & Wornell, 2018) – so, if we minimize the log-loss, other losses will also be kept small. In our experiments we have observed that when we minimize log-loss, in general other loss functions will also be reduced, see for example Fig. 7 later.
When applying the above to machine learning methods, there are several complications. One is that practical machine learning algorithms usually do not/cannot find the globally optimum solution, and the minimization in (14) therefore does not make sense. Rather, for a given set of hyperparameters we get a (suboptimum) solution (that is, is some particular output of a machine learning algorithm, not necessarily the solution of an optimization problem). To resolve this, we think of a set of hyperparameters giving a solution region rather than just a single solution and replace the solution (14) with
For example, might be a local minimum and the neighborhood where is the mimumum. Thus, is still a solution to an optimization problem, we can think of (17) as a universal source coder, and as long as is convex (we can just take it for example as an -ball around ), Theorem 1 applies.
One common set of hyperparameters is regularization parameters. Many regularization functions can be thought of giving a prior on , ; for example, regularization can be thought of as giving a Gaussian prior on . With this, we can write (17) as
essentially a MAP (Maximum Aposteriory) solution instead of a maximum likelihood solution, and Theorem 1 therefore still applies.
We focus on MDL methods that are based on actual encoding of data rather than simple approximations of codelength. We do not believe that simple approximate formulas can always capture the complexity of complex learning algorithms. As an example, the impact of regularization parameters are not characterized by simply counting the dimension of the parameter space. There are a number of such methods, for example normalized maximum likelihood (NML) (Shtar’kov, 1987), sequential NML (Roos & Rissanen, 2008), sufficient statistics (Sabeti & Host-Madsen, 2017). In the current paper we limit ourselves to Rissanen’s predictive MDL (Rissanen, 1986) that calculates a codelength
where is the maximum likelihood estimate. Predictive MDL has the advantage that is straightforward to implement once one has a maximum likelihood solution. But an issue with predictive MDL is initialization: is clearly not defined for , and likely should be large for the estimate to be good. When the initial estimate is poor, it can lead to very long codelengths, see Sabeti & Host-Madsen (2017). Fortunately, DDL completely overcomes this problem when is used. Therefore, predictive MDL is a promising method for machine learning.
4 Linear Regression
We will first show how the methodology can be applied to a simple machine learning method, linear regression.
Let , where
are the feature vectors. Assuming a Gaussian model with variance, the ML estimate with regularization is (e.g., Bishop (2006); Scharf (1990))
The estimate (20) is not defined until is at least equal to the dimension of the feature space. But even then, the estimate is not reliable, and using this directly for MDL can give a codelength which is nearly infinite, which makes predictive MDL not very useful. However, with DDL implemented through (16) we only need to calculate (20) for , which makes it much more reliable. There are recursive algorithms for updating (Haykin, 2002), so predictive MDL/DDL can be implemented very efficiently.
Figure 4 shows some experimental results. The setup is that of fitting polynomials of order up to 20 to the curve . We generate 500 random and observe , where . We seek to optimize the regularization parameter in regularization. We use DDL with and compare with cross-validation, where we use 25% of samples for cross-validation. We also compare with Bayes model selection, using the theory in Bishop (2006, Section 3.5.1) to optimize . We plot the difference from the minimum generalization error when is chosen to minimize the actual generalization error (calculated over 50,000 samples), excess generalization error. One can see that DDL essentially chooses the correct in nearly 50% of cases, and is always better than cross-validation222The reason cross-validation can have negative excess generalization error is that is calculated from only 75% of samples, and that has a chance of being better than an estimate calculated from the full set of training samples.. It also clearly better than the Bayes method (which, in its defense, was not developed specifically to minimize generalization error). The curves for MSE rather than log-loss are nearly identical, so we have not included them.
In Fig. 5 we modify the experiment to directly varying the model order without regularization. In that case we can also compare with traditional MDL (Rissanen, 1983) through the simple approximation of (14) by
We see that DDL is again better than cross-validation, and better than traditional MDL, except that DDL and cross-validation have heavy tails.
5 Neural Networks
MDL theory is based on maximum likelihood solutions, i.e., (14). On the other hand, training of a neural network is unlikely to converge to the maximum likelihood solution. Rather, the error function has many local minima, and training generally iterates to some local minimum (or a point near a local minimum), and which one can depend on the initialization condition, i.e., a solution of the type in (18). This requires adaption of methods like predictive MDL (19). Another challenge is complexity. For example, directly using predictive MDL (19) requires training for every subset of samples for , which is not computationally feasible. Applying description length to neural networks in a meaningful and practical way therefore is highly non-trivial.
There are many methods for training neural networks. Our aim is not to develop new training methods, but rather to use description length for hyperparameter optimization with any training method. We would therefore like to find the description length for a neural network with a specific solution for the weights , fairly agnostically to how that solution was found. As mentioned, in the current paper we will limit ourselves to predictive MDL. In order to use predictive MDL, we need in step of (19). For regression we need covariance estimates as well. Now to calculate we could clearly use as initialization, so that we do not need to do a full training at every stage. Therefore it could be quite computationally feasible to calculate the whole sequence , . However, this raises several issues. One is that the sequence , might not converge to ; by starting training on a small amount of data we might get an inferior solution, stuck in an undesirable local minimum; and as mentioned, our goal was to not change how we train the neural network, but check the outcome of a given training algorithm. Another issue is that updating with new data is not a solved problem in neural network training.
The solution we propose is one we might call reverse unlearning. We start with , which is obtained with any standard training algorithm. To find the solution we use as initialization for training on ; essentially we unlearn . The idea is that the solution is at or near a local minimum of the cost function for , as in (18), and that the cost function for has a nearby local minimum. By retraining on the initial solution moves to that local minimum–rather than jumping to another local minimum. We continue like that until we get a solution . The idea is that the sequence of solutions , stays close to the original solution , and the resulting description length therefore is a property of the specific solution . The method is described in Alg. 1.
Figures 6-7 shows some experimental results. The setup is the same as for linear regression in Section 4. We train a single layer neural network with 15 hidden nodes on 75 random with observations , where . We seek to optimize the regularization parameter in regularization. We use DDL with and compare with cross-validation, where we use 25% of samples for cross-validation. Differential description length is aimed at minimizing log-loss, and Fig. 6 shows that it indeed performs better than cross-validation for log-loss. Fig. 7 shows the same results for mean square error (MSE). In this case, cross validation is set to minimize MSE, whereas DDL of course still aims at minimizing log-loss. Still, DDL outperforms cross-validation.
This paper has developed the framework for DDL. There is still much work to do to make this into a practical method. First, there are other methods for implementing DDL than predictive MDL. One is direct quantization of parameters, i.e., a literal implementation of (17). For reverse unlearning there are many theoretical and practical problems surrounding that goes to the depth of now neural networks learn. We also need to conduct experiments on larger, more realistic learning problems. Finally, the paper and methodology only develops a gauge for hyperparameter selection. To use this for large scale problems with many hyperparameters, optimization algorithms are needed such as those in Li et al. (2017), where it might be possible to use our gauge as an input to the optimization.
- Alabdulmohsin (2018) Alabdulmohsin, I. Information theoretic guarantees for empirical risk minimization with applications to model selection and large-scale optimization. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 149–158, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/alabdulmohsin18a.html.
- Bishop (2006) Bishop, C. M. Pattern recognition and machine learning. springer, 2006.
- Cover & Thomas (2006) Cover, T. and Thomas, J. Information Theory, 2nd Edition. John Wiley, 2006.
- Fogel & Feder (2018) Fogel, Y. and Feder, M. Universal batch learning with log-loss. In IEEE International Symposium on Information Theory: ISIT’18 (Vail, Colorado), 2018.
Safe learning: bridging the gap between bayes, mdl and statistical learning theory via empirical convexity.In Proceedings of the 24th Annual Conference on Learning Theory, pp. 397–420, 2011.
- Grunwald (2007) Grunwald, P. D. The Minimum Description Length Principle. MIT Press, 2007.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd Edition. Spring, 2009.
- Haykin (2002) Haykin, S. Adaptive Filter Theory, 4th Edition. Pearson, 2002.
- Kawakita & Takeuchi (2016) Kawakita, M. and Takeuchi, J. Barron and cover’s theory in supervised learning and its application to lasso. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1958–1966, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/kawakita16.html.
- Li et al. (2017) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
- Painsky & Wornell (2018) Painsky, A. and Wornell, G. On the universality of the logistic loss function. In IEEE International Symposium on Information Theory: ISIT’18 (Vail, Colorado), 2018.
- Rissanen (1978) Rissanen, J. Modeling by shortest data description. Automatica, pp. 465–471, 1978.
- Rissanen (1983) Rissanen, J. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, (2):416–431, 1983.
- Rissanen (1986) Rissanen, J. Stochastic complexity and modeling. The Annals of Statistics, (3):1080–1100, Sep. 1986.
- Roos & Rissanen (2008) Roos, T. and Rissanen, J. On sequentially normalized maximum likelihood models. In Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-08), 2008.
- Sabeti & Host-Madsen (2017) Sabeti, E. and Host-Madsen, A. Enhanced mdl with application to atypicality. In 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017.
- Scharf (1990) Scharf, L. L. Statistical Signal Processing: Detection, Estimation, and Time Series Analysis. Addison-Wesley, 1990.
- Shamir (2006) Shamir, G. On the mdl principle for i.i.d. sources with large alphabets. Information Theory, IEEE Transactions on, 52(5):1939–1955, May 2006. ISSN 0018-9448. doi: 10.1109/TIT.2006.872846.
- Shtar’kov (1987) Shtar’kov, Y. M. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17, 1987.
- Von Neumann (2007) Von Neumann, J. Theory of games and economic behavior. Princeton classic editions. Princeton University Press, Princeton, N.J. ; Woodstock, 60th anniversary ed. / with an introduction by harold w. kuhn /and an afterword by ariel rubinstein.. edition, 2007.
- Watanabe & Roos (2015) Watanabe, K. and Roos, T. Achievability of asymptotic minimax regret by horizon-dependent and horizon-independent strategies. The Journal of Machine Learning Research, 16(1):2357–2375, 2015.
- Watanabe (2013) Watanabe, S. A widely applicable bayesian information criterion. Journal of Machine Learning Research, 14(Mar):867–897, 2013.
- Willems et al. (1995) Willems, F. M. J., Shtarkov, Y., and Tjalkens, T. The context-tree weighting method: basic properties. Information Theory, IEEE Transactions on, 41(3):653–664, 1995. ISSN 0018-9448. doi: 10.1109/18.382012.
- Ziv & Lempel (1977) Ziv, J. and Lempel, A. A universal algorithm for sequential data compression. Information Theory, IEEE Transactions on, 23(3):337 – 343, may 1977. ISSN 0018-9448. doi: 10.1109/TIT.1977.1055714.
- Ziv & Lempel (1978) Ziv, J. and Lempel, A. Compression of individual sequences via variable-rate coding. Information Theory, IEEE Transactions on, 24(5):530 – 536, sep 1978. ISSN 0018-9448. doi: 10.1109/TIT.1978.1055934.