We are concerned with the problem of learning with redundant models (or hypothesis classes). This setting is not uncommon in real-world machine learning and data mining problems because the amount of available data is often limited owing to the cost of data collection. In contrast, one can come up with an unbounded number of models that explain the data. For example, in sparse regression, one may consider a number of features that are much larger than that in the data, assuming that useful features are actually scarceRish and Grabarnik (2014)
. Another example is statistical conditional-dependency estimation, in which the number of the parameters to estimate is quadratic as compared to the number of random variables, while the number of nonzero coefficients are often expected to be sub-quadratic.
In the context of such a redundant model, there is a danger of overfitting, a situation in which the model fits the present data excessively well but does not generalize well. To address this, we introduce regularization and reduce the complexity of the models by taking the regularized empirical risk minimization (RERM) approach Shalev-Shwartz and Ben-David (2014)
. In RERM, we minimize the sum of the loss and penalty functions to estimate parameters. However, the choice of the penalty function should be made cautiously as it controls the bias-variance trade-off of the estimates, and hence has a considerable effect on the generalization capability.
In conventional methods for selecting such hyperparameters, a two-step approach is usually followed. First, a candidate set of penalty functions is configured (possibly randomly). Then, a penalty selection criterion is computed for each candidate and the best one is chosen. Note that this method can be applied to any penalty selection criteria. Sophisticated approaches like Bayesian optimizationMockus et al (2013) and gradient-based methods Larsen et al (1996) also tend to leave the criterion as a black-box. Although leaving it as a black-box is advantageous in that it works for a wide range of penalty selection criteria, a drawback is that the full information of each specific criterion cannot be utilized. Hence, the computational costs can be unnecessarily large if the design space of the penalty function is high-dimensional.
In this paper, we propose a novel penalty selection method that utilizes information about the objective criterion efficiently on the basis of the minimum description length (MDL) principle Rissanen (1978). We especially focus on the luckiness normalized maximum likelihood (LNML) code length Grünwald (2007) because the LNML code length measures the complexity of regularized models without making any assumptions on the form of the penalty functions. Moreover, it places a tight bound on the generalization error Grünwald and Mehta (2017). However, the actual use of LNML on large models is limited so far. This is owing to the following two issues.
LNML contains a normalizing constant that is hard to compute especially for large models. This tends to make the evaluation of the code length intractable.
Since the normalizing term is defined as a non-closed form of the penalty function, efficient optimization of LNML is non-trivial.
Next, solutions are described for the above issues. First, we derive a tight uniform upper bound of the LNML code length, namely uLNML. The key idea is that, the normalizing constant of LNML, which is not analytic in general, is characterized by the smoothness of loss functions, which can often be upper-bounded by an analytic quantity. As such, uLNML exploits the smoothness information of the loss and penalty functions to approximate LNML with much smaller computational costs, which solves issue (I1). Moreover, within the framework of the concave-convex procedure (CCCP)Yuille and Rangarajan (2003), we propose an efficient algorithm for finding a local minimima of uLNML, i.e., finding a good penalty function in terms of LNML. This algorithm only adds an extra analytic step to the iteration of the original algorithm for the RERM problem, regardless of the dimensionality of the penalty design. Thus, issue (I2) is addressed. We put together these two methods and propose a novel method of penalty selection named MDL regularization selection (MDL-RS).
We also validate the proposed method from theoretical and empirical perspectives. Specifically, as our method relies on the approximation of uLNML and the CCCP algorithm on uLNML, the following questions arise.
How well does uLNML approximate LNML?
Does the CCCP algorithm on uLNML perform well with respect to generalization as compared to the other methods for penalty selection?
For answering Question (Q1), we show that the gap between uLNML and LNML is uniformly bounded under smoothness and convexity conditions. As for Question (Q2), from our experiments on example models involving both synthetic and benchmark datasets, we found that MDL-RS is at least comparable to the other methods and even outperforms them when models are highly redundant as we expected. Therefore, the answer is affirmative.
The rest of the paper is organized as follows. In Section 2, we introduce a novel penalty selection criteria, uLNML, with uniform gap guarantees. Section 3 demonstrates some examples of the calculation of uLNML. Section 4 provides the minimization algorithm of uLNML and discusses its convergence property. Conventional methods for penalty selection are reviewed in Section 5. Experimental results are shown in Section 6. Finally, Section 7 concludes the paper and discusses the future work.
2 Method: Analytic Upper Bound of LNMLs
In this section, we first briefly review the definition of RERM and the notion of penalty selection. Then, we introduce the LNML code length. Finally, as our main result, we show an upper bound of LNML, uLNML, and the associated minimization algorithm. Theoretical properties and examples of uLNML are presented in the last part.
2.1 Preliminary: Regularized Empirical Risk Minimization (RERM)
Let be an extended-value loss function of parameter with respect to data . We assume is a log-loss (but not limited to i.i.d. loss), i.e., it is normalized with respect to some base measure over , where for all in some closed subset . Here, can be a pair of a datum and label
in the case of supervised learning. We drop the subscriptand just write if there is no confusion. The regularized empirical risk minimization (RERM) with domain is defined as the minimization of the sum of the loss function and a penalty function over ,
where is the only hyperparameter that parametrizes the shape of penalty on . Let be the minimum value of the RERM. We assume that the minimizer always exists, and denote one of them as . Here, we focus on a special case of RERM in which the penalty is linear to ,
is a convex set of positive vectors. Letbe the infimum of . We also assume that the following regularity condition holds:
Assumption 2.1 (Regular penalty functions)
If , then, for all , .
Regularization is beneficial from two perspectives. It improves the condition number of the optimization problem, and hence enhances the numerical stability of the estimates. It also prevents the estimate from overfitting to the training data , and hence reduces generalization error.
However, these benefits come with an appropriate penalization. If the penalty is too large, the estimate will be biased. If the penalty is too small, the regularization no longer takes effect and the estimate is likely to overfit. Therefore, we are motivated to select good as a function of data .
2.2 Luckiness Normalized Maximum Likelihood (LNML)
In order to select an appropriate hyperparameter , we introduce the luckiness normalized maximum likelihood (LNML) code length as a criterion for the penalty selection. The LNML code length associated with is given by
where is the normalizing factor of LNML.
The normalizing factor can be seen as a penalization of the complexity of . It quantifies how much will overfit to random data. If the penalty is small such that the minimum in (1) always takes a low value for all , becomes large. Specifically, any constant shift on the RERM objective, which does not change the RERM estimator , does not change LNML since cancels it out. Note that LNML is originaly derived by generalization of the Shtarkov’s minimax coding strategy Shtar’kov (1987), Grünwald (2007). Moreover, recent advances in the analysis of LNML show that it bounds the generalization error of Grünwald and Mehta (2017). Thus, our primary goal is to minimize the LNML code length (3).
2.3 Upper Bound of LNML (uLNML)
The direct computation of the normalizing factor requires integration of the RERM objective (1) over all possible data, and hence, direct minimization is often intractable. To avoid computational difficulty, we introduce an upper bound of that is analytic with respect to . Then, adding the upper bound to the RERM objective, we have an upper bound of the LNML code length itself.
To derive the bound, let us define -upper smoothness condition of the loss function .
Definition 1 (-upper smoothness)
A function is -upper smooth, or -upper smooth to avoid any ambiguity, over for some , if there exists a constant , vector-valued function , and monotone increasing function such that
where and .
Note that the -upper smoothness is a condition that is weaker than that of standard smoothness. In particular, -smoothness implies -upper smoothness. Moreover, it is noteworthy that all the bounded functions are upper smooth with respective .
Now, we show the main theorem that bounds . The theorem states that the upper bound depends on and only through their smoothness.
Theorem 2.2 (Upper bound of )
Suppose that is -upper smooth with respect to for all , and that is -upper smooth for . Then, for every symmetric neighbor of the origin where , we have
where , and .
Let . First, by Hölder’s inequality, we have
Then, we will bound and in the right-hand side, respectively. Since we assume that is a logarithmic loss if , the second factor is simply evaluated using Fubini’s theorem,
On the other hand, by -upper smoothness of , we have
This concludes the proof.
The upper bound in Theorem 2.2 can be easily computed by ignoring the constant factor given the upper smoothness of and . In particular, the integral can be evaluated in a closed form if one chooses a suitable class of penalty functions with a suitable neighbor ; for e.g., linear combination of quadratic functions with . Therefore, we adopt this upper bound (except with constant terms) as an alternative of the LNML code length, namely uLNML,
where the symmetric set is fixed beforehand. In practice, we recommend just taking because uLNML with bounds uLNMLs with . However, for the sake of the later analysis, we leave to be arbitrary.
We present two useful specializations of uLNML with respect to the penalty function . One is the Tikhonov regularization, known as the -regularization.
Corollary 1 (Bound for Tikhonov regularization)
Suppose that is -upper smooth for all and where for all . Then, we have
The claim follows from setting in Theorem 2.2 and the fact that is -upper smooth.
The other one is that of lasso Tibshirani (1996), known as -regularization. It is useful if one needs sparse estimates .
Corollary 2 (Bound for lasso)
Suppose that is -upper smooth for all and that , where for all . Then, we have
Finally, we present a useful extension for RERMs with Tikhonov regularization, which contains the inverse temperature parameter as a part of the parameter:
where is the normalizing constant of the loss function. Here, we assume that is independent of the non-temperature parameter . Interestingly, the normalizing factor of uLNML for a variable temperature model (7), (8) is bounded with the same bound as that for the fixed temperature models in Corollary 1 except for a constant.
Corollary 3 (Bound for variable temperature model)
Let and . Note that is a continuous function, and hence bounded over , which implies that it is upper smooth. Let be the upper smoothness of over . Then,
2.4 Gap between LNML and uLNML
In this section, we evaluate the tightness of uLNML. To this end, we now bound LNML from below. The lower bound is characterized with strong convexity of and .
Definition 2 (-strong convexity)
A function is -strong-convex if there exists a constant and a vector-valued function such that
Note that -strong convexity can be seen as the matrix-valued version of the standard strong convexity. Now, we have the following lower bound of .
Theorem 2.3 (Lower bound of )
Suppose that is -strongly convex and is -strongly convex, where and for all . Then, for every set of parameters , we have
where and .
Let . Let . First, from the positivity of , we have
Then, we bound from below and in the right-hand side, respectively. Since we assumed that is a logarithmic loss, the second factor is simply evaluated using Fubini’s theorem,
where the first inequality follows from Assumption 2.1. On the other hand, by the -strong convexity of , we have
for all . Here, we exploit the fact that we can take , if . This concludes the proof.
Theorem 2.4 (Uniform gap bound of uLNML)
The theorem implies that uLNML is a tight upper bound of the LNML code length if is strongly convex. Moreover, the gap bound (10) can be utilized for choosing a good neighbor . Suppose that there is no effective boundary in the parameter space, . Then, we can simplify the gap bound and the optimal neighbor is explicitly given.
Corollary 4 (Uniform gap bound for no-boundary case)
Suppose that the assumptions made in Theorem 2.4 is satisfied. Then, if , we have a uniform gap bound
for all and all . This bound is minimized with maximum , i.e., .
As a remark, if we assume in addition that is a smooth i.i.d. loss, i.e., and , the gap bound is also uniformly bounded with respect to the sample size . This is derived from the fact that the right-hand side of turns out to be
which is constant independent of .
In previous sections, we derived an upper bound of the normalizing constant and defined an easy-to-compute alternative for the LNML code length, called uLNML. We also stated uniform gap bounds of uLNML for smooth penalty functions. Note that uLNML characterizes
with upper smoothness of the loss and penalty functions. This is both advantageous and disadvantageous. The upper smoothness can often be easily computed even for complex models like deep neural networks. This makes uLNML applicable to a wide range of loss functions. On the other hand, if the Hessian of the loss function drastically varies across, the gap can be considerably large. In this case, one can tighten the gap by reparametrizing to make the Hessian as uniform as possible.
The derivation of uLNML relies on the upper smoothness of the loss and penalty functions. In particular, our current analysis on the uniform gap guarantee given by Theorem 2.4 holds if the penalty function is smooth, i.e., . This is violated if one employs the -penalties.
It should be noted that there exists approximation of LNML originally given by Rissanen (1996) for a special case and then generalized by Grünwald (2007). This approximates LNML except for the term with respect to ,
where denotes the Fisher information matrix. A notable difference between this approximation and uLNML is in the boundedness of their approximation errors. The above term is not necessarily uniformly bounded with respect to , and actually it diverges for every fixed as in the case of, for example, the Tikhonov regularization. This is in contrast to uLNML in that the approximation gap of uLNML is uniformly bounded with respect to according to Corollary 2.4, and it does not necessarily go to zero as . This difference can be significant, especially in the scenario of penalty selection, where one compares different while is fixed.
3 Examples of uLNML
In the previous section, we have shown that the normalizing factor of LNML is bounded if the upper smoothness of is bounded. The upper smoothness can be easily characterized for a wide range of loss functions. Since we cannot cover all of it here, we present below a few examples that will be used in the experiments.
3.1 Linear Regression
Let be a fixed design matrix and represent the corresponding target variables. Then, we want to find such that . We assume that the ‘useful’ predictors may be sparse, and hence, most of the coefficients of the best
for generalization may be close to zero. As such, we are motivated to solve the ridge regression problem:
where . According to Corollary 3, the uLNML of the ridge regression is given by
where . Note that the above uLNML is uniformly bounded because the normalizing constant of the LNML code length of (13) is bounded from below with a fixed variance that exactly evaluates to .
3.2 Conditional Dependence Estimation
Let be a sequence of observations independently drawn from the
-dimensional Gaussian distribution. We assume that the conditional dependence among the variables in is scarce, which means that most of the coefficients of precision are (close to) zero. Thus, to estimate the precision matrix , we penalize the nonzero coefficients and consider the following RERM
denotes the probability density function of the Gaussian distribution. As it is an instance of the Tikhonov regularization, from Corollary1 with , the uLNML for the graphical model is given by
4 Minimization of uLNML
Given data , we want to minimize uLNML (5) with respect to as it bounds the LNML code length, which is a measure of the goodness of the penalty with respect to the MDL principle Rissanen (1978), Grünwald (2007). Furthermore, it bounds the risk of the RERM estimate Grünwald and Mehta (2017). The problem is that grid-search-like algorithms are inefficient since the dimensionality of the domain is high.
In order to solve this problem, we derive a concave-convex procedure (CCCP) for uLNML minimization. The algorithm is justified with the convergence properties that result from the CCCP framework. Then, we also give concrete examples of the computation needed in the CCCP for typical RERMs.
4.1 Concave-convex Procedure (CCCP) for uLNML Minimization
In the forthcoming discussion, we assume that is closed, bounded, and convex for computational convenience. We also assume that the upper bound of the normalizing factor is convex with respect to . This is not a restrictive assumption as the true normalizing term is always convex if the penalty is linear as given in (2). In particular, it is actually convex for the Tikhonov regularization and lasso as in Corollary 1 and Corollary 2, respectively.
Recall that the objective function, uLNML, is written as
Therefore, the goal is to find that attains
as well as the associated RERM estimate . Note that the existence of follows from the continuity of the objective function