Accelerating Cross-Validation in Multinomial Logistic Regression with ℓ_1-Regularization

We develop an approximate formula for evaluating a cross-validation estimator of predictive likelihood for multinomial logistic regression regularized by an ℓ_1-norm. This allows us to avoid repeated optimizations required for literally conducting cross-validation; hence, the computational time can be significantly reduced. The formula is derived through a perturbative approach employing the largeness of the data size and the model dimensionality. Its usefulness is demonstrated on simulated data and the ISOLET dataset from the UCI machine learning repository.

Authors

• 14 publications
• 18 publications
11/20/2020

Optimizing Approximate Leave-one-out Cross-validation to Tune Hyperparameters

For a large class of regularized models, leave-one-out cross-validation ...
10/25/2016

Approximate cross-validation formula for Bayesian linear regression

Cross-validation (CV) is a technique for evaluating the ability of stati...
02/28/2017

Optimal Categorical Attribute Transformation for Granularity Change in Relational Databases for Binary Decision Problems in Educational Data Mining

This paper presents an approach for transforming data granularity in hie...
07/04/2019

Subsampling Bias and The Best-Discrepancy Systematic Cross Validation

Statistical machine learning models should be evaluated and validated be...
03/09/2016

Computing AIC for black-box models using Generalised Degrees of Freedom: a comparison with cross-validation

Generalised Degrees of Freedom (GDF), as defined by Ye (1998 JASA 93:120...
11/26/2019

The Early Roots of Statistical Learning in the Psychometric Literature: A review and two new results

Machine and Statistical learning techniques become more and more importa...
08/28/2020

How is Machine Learning Useful for Macroeconomic Forecasting?

We move beyond "Is Machine Learning Useful for Macroeconomic Forecasting...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multinomial classification is a ubiquitous task. There are several ways to treat this task, such as the naive Bayesian methods, neural networks, decision trees, and hierarchical classification schemes

(Trevor et al., 2009). Among them, in this paper, we focus on multinomial logistic regression (MLR), which is simple but powerful enough to be used in many present day applications.

Let us denote each feature vector by

and its class by , where denotes the index of given data. The MLR uses a linear structural model with parameters and computes a class- bias as an overlap:

 uμa=x⊤μwa. (1)

A probability such that the feature vector

belongs to the class is computed through a softmax function as:

 ϕ(a∣∣{uμb}Lb=1)=euμa∑Lb=1euμb. (2)

These define the MLR.

The maximum likelihood estimation is usually employed to train the MLR, though the learning result tends to be inefficient when the data size is not sufficiently larger than the model dimensionality or noises in relevant levels are present. A common technique to overcome this difficulty is to introduce a penalty or regularization. In this paper, we use an

-regularization, which induces a sparse classifier as a learning result and is accepted to be effective. Given

data points , the -regularized estimator is defined by the following optimization problem:

 (3) H({wa}La=1∣∣DM,λ)≡M∑μ=1qμ({wa}La=1)+λL∑a=1||wa||1, (4) qμ({wa}La=1)=−lnϕ(yμ∣∣{uμa=x⊤μwa}La=1), (5)

where we denote the negative log-likelihood as and define a regularized cost function or Hamiltonian .

The introduction of regularization causes another problem of model selection or hyper-parameter estimation with respect to . A versatile framework providing a reasonable estimate is cross-validation (CV), but it has a disadvantage in terms of the computational cost. The literal CV requires repeated optimizations which can be a serious computational burden when the data size and the model dimensionality are large. The purpose of this paper is to resolve this problem by inventing an efficient approximation of CV.

Our technique is based on a perturbative expansion employing the largeness of the data size and the model dimensionality. Similar techniques were also developed for the Bayesian learning of simple perceptron and committee machine

(Opper and Winther, 1996, 1997)

, for Gaussian process and support vector machine

(Opper and Winther, 2000a, b; Vapnik and Chapelle, 2000)

, for linear regression with the

-regularization (Obuchi and Kabashima, 2016; Rad and Maleki, 2018; Wang et al., 2018) and with the two-dimensional total variation (Obuchi et al., 2017). Actually, this perturbative approach is fairly general and can be applied to a wide class of generalized linear models with simple convex regularizations. For example in the present MLR case, it is easy to extend our result to the case where both the - and -regularizations exist (elastic net, Zou and Hastie, 2005), which is used in a common implementation (Friedman et al., 2010). The derivation of our approximate formula below is, however, conducted on the case of the -regularization only, for simplicity. The extension to the elastic net case is stated after the derivation.

The rest of the paper is organized as follows. In sec. 2, we state our formulation and how to derive the approximate formula. In sec. 3, we compare our approximation result with that of the literally conducted CV on simulated data and on the ISOLET dataset from UCI machine learning repository (Lichman, 2013). The accuracy and the computational time of our approximate formula are reported in comparison with the literal CV. The limitation of a simplified version of the approximation is also examined. The last section is devoted to the conclusion.

2 Formulation

In the maximum likelihood estimation framework, it is natural to employ a predictive likelihood as a criterion for model selection (Bjornstad, 1990; Ando and Tsay, 2010). We require a good estimator of the predictive likelihood, and the CV provides a simple realization of it. Particularly in this paper, we consider an estimator based on the leave-one-out (LOO) CV. The LOO solution is described by

 {^w∖μa(λ)}a=argmin{wa}a{H∖μ({wa}La=1∣∣DM,λ)}, (6) (7)

Denoting the overlap of with the LOO solution as , as well as that with the full solution , we can define the LOO estimator (LOOE) of the predictive negative log-likelihood as:

 ϵLOO(λ)=1MM∑μ=1qμ({^w∖μa}La=1)=−1MM∑μ=1lnϕ(yμ∣∣{^u∖μμa}La=1). (8)

In the following, the predictive negative log-likelihood is simply called prediction error. The minimum of the LOOE determines the optimal value of though its evaluation requires us to solve eq. (6) times, which is computationally demanding.

2.0.1 Notations

Here, we fix the notations for a better flow of the derivation shown below. By summarizing the class index, we introduce a vector notation of the overlap as and an extended vector representation of the weight vectors as . The th component of can thus be decomposed into two parts as where denotes the class index and represents the component index of the feature vector. Namely we write . Correspondingly, we leverage a matrix to define a repetition representation of the feature vector : Each component is defined as:

 Xμam≡δamcxμmf. (9)

This yields simple and convenient relations:

 uμ=XμW, Xμ=(∂uμ∂W)⊤. (10)

Further, the class- probability of th data at the full solution is denoted by:

 pa|μ=ϕ(a|{^uμb}b)=e^uμa∑Lb=1e^uμb (11)

These notations express the gradient and the Hessian of at the full solution as:

 ∇qμ(^W)≡∂qμ∂W∣∣∣W=^W=∂uμ∂W∂∂uμqμ∣∣∣uμ=^uμ=(Xμ)⊤bμ, (12) ∂2qμ(^W)≡∂2qμ∂W∂W′∣∣∣W=W′=^W =∂uμ∂W⎛⎝∂2qμ∂uμ∂u′μ∣∣∣uμ=u′μ=^uμ⎞⎠(∂u′μ∂W′)⊤=(Xμ)⊤FμXμ, (13)

where

 bμ≡(p1|μ−δ1yμ,p2|μ−δ2yμ,⋯,pL|μ−δLyμ)⊤, (14) Fμab≡δabpa|μ−pa|μpb|μ. (15)

In addition, we denote the cost function Hessians at the respective solutions as:

 G≡∂2H(^W)=∑μ(∂2qμ(^W)), (16) G∖μ≡∂2H∖μ(^W∖μ)=∑ν(≠μ)(∂2qν(^W∖μ)). (17)

Finally, we introduce the symbol representing the index set of the active components of and . Given , we denote the active components of a vector by the subscript as . A similar notation is used for any matrix and the symbol is assumed to represent all of the components in the corresponding dimension.

2.1 Approximate formula

For a simple derivation, it is important to consider that the -dependence of appears only in the overlap . Hence, it is sufficient to provide the relation between and in order to derive the approximate formula.

A crucial assumption to derive the formula is that the active set is “common” between the full and LOO solutions, and ; namely . Although this assumption is literally not true, we numerically confirmed that this approximately holds. In other words, the change of the active set is small enough compared to the size of the active set itself when considering the LOO operation when and are large. Moreover, in a related problem of an -regularized linear regression, the so-called LASSO, it has been shown that the contribution of the active set change vanishes in a limit keeping  (Obuchi and Kabashima, 2016). It is expected that the same holds in the present problem. Hence, we adopt this assumption in the following definition. Note that this idea of the active set constantness can be found in preceding analyses of support vector machine (Opper and Winther, 2000b; Vapnik and Chapelle, 2000).

Once the active set is assumed to be known and unchanged by the LOO operation, it is easy to determine the active components of the full and LOO solutions and . The vanishing condition of the gradient of the cost function is the determining equation:

 (∇H)^A=0⇒^W^A, (18) (∇H∖μ)^A=(∇H)^A−(∇qμ)^A=0⇒^W∖μ^A. (19)

The difference between the gradients is only , and hence the difference between and is expected to be small. Denoting the difference as and expanding eq. (19) with respect to up to the first order, we obtain an equation determining :

 dμ^A=−(G∖μ^A^A)−1(∇qμ(^W))^A. (20)

Inserting this and eq. (12) into the definition and multiplying from left, we obtain:

 ^u∖μμ≈^uμ+C∖μμbμ, (21) C∖μμ≡Xμ∗^A(G∖μ^A^A)−1(Xμ∗^A)⊤. (22)

This equation implies that the matrix inversion operation is necessary for each , which still requires a significant computational cost. To avoid this, we employ an approximation and the Woodbury matrix inversion formula in conjunction with eqs. (13,16,17). The result is:

 (G∖μ)−1≡(∂2H∖μ(^W∖μ))−1≈(∂2H∖μ(^W))−1=(G−(Xμ)⊤FμXμ)−1 =G−1−G−1(Xμ)⊤(−Fμ+XμG−1(Xμ)⊤)−1XμG−1. (23)

Inserting this into eq. (21) and simplifying several factors, we obtain:

 ^u∖μμ≈^uμ+Cμ(IL−FμCμ)−1bμ, (24)

where

 Cμ=Xμ∗^A(G^A^A)−1(Xμ∗^A)⊤. (25)

Now, all of the variables on the righthand side of eq. (24) can be computed from the full solution only, which enables us to estimate the LOOE by leveraging a one-time optimization using all of the data , while avoiding repeated optimizations.

We should mention the computational cost of this approximation: it is mainly scaled as . The first two terms come from the construction of and , and the last one is derived from the inverse of . If is proportional to the feature dimensionality , this computational cost is of the third order with respect to the system dimensionality and . This is admittedly not cheap and the computational cost for the -fold literal CV with a moderate value of becomes smaller than that for our approximation in a large dimensionality limit. We, however, stress that there actually exists a wide range of and values in which our approximation outperforms the literal CV in terms of the computational time, as later demonstrated in sec. 3. Moreover, for treating much larger systems, we invent a further simplified approximation based on the above approximate formula. The computational cost of this simplified version is scaled only linearly with respect to the system parameters and . Its derivation is in sec. 2.2 and the precision comparison to the original approximation is in sec. 3.

Another sensitive issue is present in computing . Occasionally the cost function Hessian

has zero eigenvalues and is not invertible. We handle this problem in the next subsection.

2.1.1 Handling zero modes

In the MLR, there is an intrinsic symmetry such that the model is invariant under the addition of any constant vector to the weight vectors of all classes:

 wa→wa+v (∀a). (26)

In this sense, the weight vectors defining the same model are “degenerated” and our MLR is singular. For finite , this is not harmful because the regularization term resolves this singularity and selects an optimal one with the smallest value of among the degenerated vectors. However, this does not mean that the associated Hessian is non-singular. The regularization term does not provide any direct contribution to the Hessian and as a result, the Hessian tends to have some zero modes. This prevents taking the inverse Hessian in eq. (25). How can we overcome this?

One possibility is to fix the weights of one certain class at constant values when solving the optimization problem (4). This is termed “gauge fixing” in physics, and one convenient gauge in the present problem will be the zero gauge in which the weights in a chosen class are fixed at zeros. This is actually found in some earlier implementations (Krishnapuram et al., 2005; Schmidt, 2010) and is preferable for our approximate formula because it removes the harmful zero modes of the Hessian from the beginning. However, some other implementations which are currently well accepted do not employ such gauge fixing (Friedman et al., 2010), and moreover even with gauge fixing very small eigenvalues sometimes accidentally emerge in the Hessian. Hence, for user convenience, we require another way of avoiding this problem.

Another possibility is to remove the zero modes by hand. By construction, the zero modes are associated to the model invariance. This implies that those zero modes are irrelevant and may be removed. In fact, we are only interested in the perturbations which truly change the model, and the modes which maintain the model unchanged are unnecessary. According to this consideration, we replace in eq. (25) with the zero-mode-removed inverse Hessian . The computation of is straightforward: we perform the eigenvalue decomposition of and obtain the eigenvalues

and eigenvectors

, which allows us to represent

 G^A^A=∑idiviv⊤i=∑i∈S+diviv⊤i, (27)

where denotes the index set of the modes with finite eigenvalues. Then, is defined as:

 ¯¯¯¯G−1^A^A≡∑i∈S+d−1iviv⊤i. (28)

Finally, we replace by in eq. (25), and obtain:

 Cμ=Xμ∗^A¯¯¯¯G−1^A^A(Xμ∗^A)⊤. (29)

By using this instead of eq. (25), the problem caused by the zero modes can be avoided.

2.1.2 Extension to the mixed regularization case

Let us briefly state how we can generalize the present result to the case of the mixed regularizations of the - and -terms (elastic net, Zou and Hastie, 2005). The problem to be solved can be defined as follows:

 {^wa(λ1,λ2)}a = argmin{wa}a{M∑μ=1qμ({wa}La=1)+λ1L∑a=1||wa||1+λ22L∑a=1||wa||22}. (30)

where denotes the norm. Following the derivation in sec. 2.1, we realize that the derivation is essentially the same, and the difference only appears in the cost function Hessian:

 Gmxd=∑μ(∂2qμ(^W))+λ2INL, (31)

where

is the identity matrix of size

. As a result, we can compute the LOO solution by leveraging the same equation as eq. (24) by replacing the definition of , eq. (25), with:

 Cμ=Xμ∗^A((Gmxd)^A^A)−1(Xμ∗^A)⊤. (32)

Thanks to the term, the zero mode removal is not needed since the eigenvalues are lifted up by .

2.1.3 Binomial case

The binomial case is particularly interesting in several applications and thus we write down the specific formula for this case.

In the binomial case, it is fairly common to express the class as a binary

and to use the following logit function:

 ϕlogit(yμ|uμ)=δyμ1+δyμ0e−uμ1+e−uμ, (33)

where

 uμ=x⊤μw. (34)

If we identify in this case as in the two-class MLR case, this is nothing but the two-class MLR with a zero gauge . Hence, there is no harmful zero mode in the Hessian and we can straightforwardly apply our approximate formula. The explicit form in this case is:

 ^u∖μμ≈^uμ+cμ1−∂2qμ∂u2μcμ∂qμ∂uμ, (35)

where and

 ∂qμ∂uμ=δyμ0−e−uμ1+e−uμ, (36) ∂2qμ∂u2μ=e−uμ(1+e−uμ)2, (37) G^A^A=M∑μ=1∂2qμ∂u2μ(xμx⊤μ)^A^A, (38) cμ=(x⊤μ)^A(G^A^A)−1(xμ)^A, (39)

and is the active set of the full solution, as before.

Note that this approximation can be easily generalized to arbitrary differentiable output functions by replacing the logit function . Readers are thus encouraged to implement approximate CVs in a variety of different problems.

2.2 Further simplified approximation

As mentioned above, the computational cost of our approximation is and should be reduced for treating larger systems. For this, we derive a further simplified approximation based on the invented approximate formula above. We call this a self-averaging (SA) approximation according to physics terminology.

The basic idea for simplifying our approximate formula is to assume that correlations between and are sufficiently weak. The meaning of “correlation” is not evident here, but as seen in sec. A the Hessian can be connected to a (rescaled) covariance between and

in a statistical mechanical formulation introducing a probability distribution of

. Our weak correlation assumption requires that the correlation between different feature components is negligibly small; , where are the class indices and are the feature component indices defined thus far, and is the rescaling factor. In this way, the Hessian is assumed to be expressed in a rather restricted form:

 (40)

Namely, the SA Hessian is allowed to take finite values if and only if the two indices share the same feature vector component. The dependence on the data index is also assumed to be negligible, implying that strong heterogeneity among feature vectors is assumed to be absent.

To proceed with the computation, we require a closed equation to determine the matrix for . Its derivation is rather technical and is deferred to sec. A. The result is:

 (χi)^Ai^Ai=(λ2I|^Ai|+σ2xM∑ν=1((IL+FνCSA)−1Fν)^Ai^Ai)−1, (41)

where and is the set of active class variables at the feature component ; the other components of related to inactive variables are zeros. The SA approximation of , , is defined by:

 CSA=σ2xN∑i=1χi. (42)

Using the solution of eqs. (41,42), the approximate formula is now simply expressed as:

 ^u∖μμ≈^uμ+CSAbμ. (43)

Note that there is no factor like in contrast to eq. (24), because we directly approximate in eq. (21).

When solving eqs. (41,42), the inverse at the right-hand side of eq. (41) becomes occasionally ill-defined again due to the presence of zero modes. In such cases, we should remove the zero modes as eq. (28). Putting and performing the eigenvalue decomposition, we define its zero-mode-removed inverse as:

 R^Ai^Ai=∑jdjvjv⊤j=∑j∈S+djvjv⊤j⇒¯¯¯¯R−1^Ai^Ai=∑j∈S+d−1jvjv⊤j, (44)

where is the index set of the modes with finite eigenvalues. This requires a computational cost at a maximum. Leveraging this approach, a naive way to solve eqs. (41,42) is a recursive substitution. If this converges in a constant time, irrespectively of the system parameters and , then the computational cost of the SA approximation is scaled as . This is linear in the feature dimensionality and the data size and hence, its advantage is significant.

2.3 Summary of procedures

Here, we summarize the two versions of the approximation derived thus far as algorithmic procedures. We call the first version, based on eq. (24), the approximate CV or ACV, and call the second one, using eq. (43), the self-averaging approximate CV or SAACV. The procedures of ACV and SAACV are given in Alg. 1 and Alg. 2, respectively; they are written for the case of the mixed regularization (30).

Comments are added for specifying the time consuming parts in the entire procedures. In Alg. 2, we describe an actual implementation for solving by recursion, which is not fully specified in sec. 2.2. The symbol denotes the Frobenius norm and we set the threshold judging the convergence as in typical situations. We also set as the threshold judging if is large or not.

3 Numerical experiments

In this section, we examine the precision and actual computational time of ACV and SAACV in numerical experiments. Both simulated and actual datasets (from UCI machine learning repository, Lichman, 2013) are used.

For examination, we compute the errors also by literally conducting -fold CV with some s, and compare it to the result of our approximate formula. In principle, we should compare our approximate result with that of the LOO CV () because our formula approximates it. However for large , the literal LOO CV requires huge computational burdens despite that the result is empirically not much different from that of the -hold CV with moderate s. Hence in some of the following experiments with large , we use the -hold CV instead of the LOO CV. Further, to directly check the approximation accuracy, we also compute the normalized error difference defined as

 ϵapproximateLOO−ϵliteralCVϵliteralCV, (45)

where denotes the literal CV estimator of the prediction error while is the approximated LOOE. Moreover, as a reference, we compute the negative log-likelihood of the full solution as:

 ϵ=1MM∑μ=1qμ({^wa}La=1), (46)

and call it the training error, hereafter. The training error is expected to be a monotonic increasing function with respect to , while the prediction one is supposed to be non-monotonic.

In all of the experiments, we used a single CPU of Intel(R) Xeon(R) E5-2630 v3 2.4GHz. To solve the optimization problems in eqs. (4,6), we employed Glmnet (Friedman et al., 2010) which is implemented as a MEX subroutine in MATLAB®. The two approximations were implemented as raw codes in MATLAB. This is not the most optimized approach, because as seen in Algs. 1,2 our approximate formula uses a number of for and while loops which are slow in MATLAB, and hence the comparison is not necessarily fair. However, even in this comparison there is a significant difference in the computational time between the literal CV and our approximations, as shown below.

In Glmnet, the corresponding optimization problem is parameterized as follows:

 {^wa(~λ,η)}a=argmin{wa}a{1MM∑μ=1qμ({wa}La=1)+~λ(ηL∑a=1||wa||1+(1−η)2L∑a=1||wa||22)}. (47)

In the following experiments, we present the results based on this parameterization. We basically prefer in which the term is absent, because the main contribution of the present paper is to overcome technical difficulties stemming from the term. However, Glmnet or its employing algorithm occasionally loses its stability in some uncontrolled manner without the term. Hence, in the following experiments we adaptively choose the value of .111When employing our distributed codes implementing the approximate formula (Obuchi, 2017; Takahashi and Obuchi, 2017) in conjunction with Glmnet, the parameters and are read as and .

A sensitive point which should be noted is the convergence problem of the algorithm for solving the present optimization problem. In Glmnet, a specialized version of coordinate descent methods is employed, and it requires a threshold to judge the algorithm convergence. Unless explicitly mentioned, we set this as being tighter than the default value. This is necessary since we treat problems of rather large sizes. A looser choice for rather strongly affects the literal CV result, while it does not change the full solution or the training error as much. As a result, our approximations employing only the full solution are rather robust against the choice of compared to the literal CV. This is also demonstrated below.

3.1 On simulated dataset

Let us start by testing with the simulated data. Suppose each “true” feature vector is independently identically drawn (i.i.d.) from the following Bernoulli-Gaussian prior:

 w0a∼N∏i=1{(1−ρ0)δ(w0ai)+ρ0N(0,1/ρ0)}, (48)

where

denotes a Gaussian distribution whose mean and variance are

and , respectively. The resultant feature vector becomes -sparse and its norm becomes on average. Then, we choose a class from uniformly and randomly, and generate an observed feature vector by leveraging the following linear process:

 xμ=w0yμ√N+ξ, (49)

where is an observation noise each component of which is i.i.d. from a Gaussian .

For convenience, we introduce the ratio of the data size to the feature dimensionality, , and now obtain five parameters characterizing the experimental setup. It is rather heavy to obtain the dependence of all parameters and below, and hence we mainly focus on the dependence on , , and . Other parameters are set to be and .

3.1.1 Result

Let us summarize the result on simulated data.

Fig. 1 shows the plots of the prediction and training errors against for at and .

This demonstrates that both approximations provide consistent results with the literal LOO CV, except at small s. This inconsistency at small s is considered to be due to a numerical instability occurring in the literal CV. Actually, for small s, we have observed that certain small changes in the data induce large differences in the literal CV result. This example demonstrates that our approximations provide robust curves even in such situations. Note that as grows the number of estimated parameters increases while the data size is fixed, meaning that the problem becomes more and more underdetermined with the growth of . Hence, Fig. 1 demonstrates that the developed approximations work irrespectively of how much the problem is underdetermined.

Fig. 2 exhibits the -dependence of the errors and the approximation results for and .

For the very weak noise case (, left), the difference between the predictive and training errors is negligible and hence all four curves are not discriminable. For the moderate (, middle) and large (, right) noise cases, the training curve is very different from the predictive ones. The approximation curves are again consistent with the literal LOO one.

Fig. 3 demonstrates how the approximation accuracy changes as the system size grows.

For small sizes , a discriminable difference exists between the results of the approximations and the literal LOO CV, as well as the difference between the results of the two approximations. This is expected, because our derivation relies on the largeness of and . For large systems , the difference among the two approximations and the literal CV is much smaller. Considering this example in conjunction with the middle panel of Fig. 1, we can recognize that our approximate formula becomes fairly precise for in this parameter set. The normalized error difference corresponding to Fig. 3 is shown in Fig. 4.

We can observe that the difference tends to be smaller as the system size increases, which is expected because the perturbation employed in our approximate formula is justified in the large limit.

Finally, let us consider the actual computational time to evaluate and the approximate LOOEs, and observe its system size dependence. The left panel of Fig. 5 provides the plot of the actual computational time against the system size. Here, the number of examined points of to obtain a solution path is different from size to size, and hence the plotted time is given as the whole computational time to obtain the solution path divided by the number of s points.

The left panel of Fig. 5 clearly displays the advantage and disadvantage of the developed approximations. For small sizes, the computational time for optimization to obtain is shorter than the time to compute the approximate LOOEs, and hence the literal CV is better. However, for larger systems, the optimization cost increases rapidly and for the approximate CV is better. For