1 Introduction
Successful application of any machine learning (ML) algorithm heavily depends on the careful tuning of its hyperparameters. However, hyperparameter optimization (HPO) is a nontrivial problem and has been a topic of research for several decades. Most existing works broadly adopt one of the following two approaches, a) BlackBox optimization or, b) Direct optimization. The BlackBox optimization approach is agnostic of the underlying ML model and adopts an advanced search algorithm to select the hyperparameters that minimizes the validation error. Popular examples include, Random search, Grid search, Bayesian Optimization based approaches
(Bergstra et al., 2013; Snoek et al., 2012) or other advanced exploration techniques like (Li et al., 2017) etc. The advantage is that it is applicable to any machine learning problem. However, it does not utilize the structure of the underlying ML model, which can be exploited for faster solutions. The Direct Optimization utilizes the underlying ML algorithm structure by parameterizing the validation error as a function of the hyperparameters, and directly optimizes it. Typical examples include, bilevel hyperparameter optimization (Lorraine and Duvenaud, 2018; MacKay et al., 2019; Pedregosa, 2016; Franceschi et al., 2017; Maclaurin et al., 2015; Franceschi et al., 2018; Mehra and Hamm, 2019) and bound based analytic model selection (Chapelle et al., 2002; Dhar et al., 2019; Cortes et al., 2017) etc. Such approaches can provide significant computation improvements for HPO. However the limitations of these approaches include, a) the gradients of the validation error w.r.t the hyperparameters must exist and b) instability of the optimization problem under illconditioned settings. In this paper we address this unstable convergence behavior of the bilevel HPO approaches for illconditioned problems, and propose a new framework to improve it. The main contributions of this work are,
We propose a new algorithm called MoreauYosida regularized Hyperparameter Optimization (MYHPO) to stabilize bilevel HPO for illconditioned problems.

We provide theoretical analysis on the correctness of the proposed MYHPO algorithm and also provide some initial convergence analysis.

Finally, we provide extensive empirical results in support of our proposed approach.
The paper is structured as follows. Section 2 introduces the bilevel formulation for HPO and discusses the existing issues of the stateofart bilevel HPO solvers (Lorraine and Duvenaud, 2018; MacKay et al., 2019) under illconditioned settings. Section 3 introduces the MoreauYosida regularized Hyperparameter Optimization (MYHPO) algorithm as a solution to stabilize bilevel HPO for such settings. Additional theoretical analysis of its solution’s correctness and convergence behavior is also provided. Empirical results and discussions are provided in section 4.
2 BiLevel Hyperparameter Optimization (HPO)
The standard bilevel formulation for hyperparameter optimization (HPO) is given as,
(1) 
Here, Validation loss, Training loss, Hyperparameters, Model parameters. A popular approach involves introducing a bestresponse function (Lorraine and Duvenaud, 2018; MacKay et al., 2019) to rather solve,
(2) 
is the bestresponse function parameterized by the hyperparameters . For simplicity we limit our discussion to only scalar (single) . Of course, it can be easily extended to multiple hyperparameters. The work on Stochastic Hyperparameter Optimization (SHO) (Lorraine and Duvenaud, 2018) use a hypernetwork to model the bestresponse function as , where and ; with ; and adopts an alternate minimization of the training loss w.r.t (hypernetwork parameters), and validation loss w.r.t (hyperparameters) to solve the HPO problem. Their proposed algorithm is provided in Appendix B. MacKay et al. (2019) adopts a similar alternating gradient approach but modifies the algorithm hypothesis class (by scaling and shifting the network hidden layers) and adds additional stability through adaptively tuning the degree of stochasticity (i.e. perturbation scale) for the gradient updates. In this work we adopt the SHO algorithm 2 as a representative for such alternating gradients approaches for solving the bilevel HPO.
One major limitation of these alternating gradient based bilevel HPO algorithms is that, the convergence behavior heavily depends on the conditioning of the problem. Basically, for illconditioned settings, the stepsize used for gradient updates need to be sufficiently small to ensure stability of such algorithms. This in turn leads to poor convergence rates (also see in our results section 4). To alleviate this instability and ensure improved convergence we introduce our MoreauYosida regularized Hyperparameter Optimization (MYHPO) algorithm. For simplicity we will assume unique solutions for (2) in the rest of the paper.
3 MoreauYosida regularized HPO
First we reformulate the problem in (2). Given, we solve,
(3)  
s.t. 
Here, we assume that and are provided to us by an oracle satisfying at solution . Proposition 3 justifies solving (3) in lieue of (2).
The problem (3) is a sum of two functions parameterized by different arguments connected through an equality constraint. Such formulations are frequently seen in machine learning problems and popularly solved through variants of Alternating Direction Method of Multipliers (Boyd et al., 2011; Goldstein et al., 2014), Alternating Minimization Algorithm (Tseng, 1991) or Douglas Rachford Splitting (Eckstein and Bertsekas, 1992) etc. However, a major difference in (3) is that the oracle solution and is not available a priori. Hence, we cannot directly apply these existing approaches to solve (3). This leads to our new algorithm called MoreauYosida regularized Hyperparameter Optimization (MYHPO) to solve (3). At iteration we take the steps,
Step 1. Update
(4) 
(5) 
(6) 
Step 4. Update consensus  (7) 
The complete algorithm is provided in Appendix C (Algorithm 3). Step , ensures a unique solution for given i.e. . In Steps and rather than taking the gradient updates of (as in SHO); we take the gradient of the MoreauYosida (MY) regularized functions. The MoreauYosida (MY) regularization of a function is defined as and serves as a smooth approximate for . Note that (5) and (6) transforms to gradient updates of the MY regularized and in its scaled form (Boyd et al., 2011). This lends to better stability of these updates in Steps ; and is highly desirable for illconditioned problems. Another aspect of this algorithm is that now and updates are not agnostic of each other. The
terms constrains against larger steps in a direction detrimental to either of the loss functions
. In essence these additional terms maintain consensus between the updates while minimizing w.r.t , and w.r.t separately. Any deviation from this consensus is captured in update and fed back in the next iteration. The userdefined, controls the scale of these augmented terms and hence convergence of the algorithm. The proposed algorithm has the following convergence properties,(Convergence Criteria) The necessary and sufficient conditions for Steps to solve (3) are, and . Further we make the following claim on the algorithm’s convergence to its stationary points.
Claim 1
(Convergence Guarantee) Under the assumptions that , are proper, closed and convex; with strongly convex and is bounded . The steps converges to a stationary point. Further at stationary point we have and .
All proofs are provided in Appendix. To maintain same perstep iteration cost as the SHO algorithm we simplify the steps 14 further and rather take single gradient updates in Steps 13. This results to the simplified MYHPO Algorithm 1 which is used throughout our experiments. For simplicity we call this simplified Algorithm 1 as MYHPO through out the paper. Additionally, to ensure descent direction we also add a backtracking (BT) of stepsize scaled by a factor of (see chapter 4 in (Beck, 2014)). The refer to the algorithm with backtracking as MYHPO(BT) and without it i.e. constant stepsize as MYHPO (C).
4 Results and Summary
4.1 Experimental Settings
We provide analysis for both regression and classification problems using the loss functions,

[nosep,leftmargin=*]

Least Square: ,

Logistic: ,
Here, , and (regression) or (classification), = dimension of the problem, = Training data, = number of training samples and = Validation Data, = number of validation samples. The experimental details are provided in Table 1. Here, for regression, we use the MNIST data to regress on the digit labels ‘0’  ‘9’ (Park et al., 2020) and the Cookie to predict the percentage of ‘fat’ w.r.t the ‘flour’ content using near infrared region (NIR) values (Osborne et al., 1984)
. For classification, we classify between digits ‘0’ vs. ‘1’ for MNIST, and the traffic signs ‘30’ vs. ‘80’ for GTSRB.
4.2 Results
Table 2 provides the average (standard deviation) of the loss values on training,validation and a separate test data, over 10 runs of the experiments. Here, for each experiment we partition the data sets in the same proportion as shown in Table 1
. For regression we scale the output’s loss with the variance of y
values. This is a standard normalization technique which illustrates the proportion of unexplained variance by the estimated model. For the SHO and MYHPO algorithms we only report the performance of the models using the best performing stepsizes. A detailed ablation study using different stepsizes for the algorithms is provided in Appendix
D. In addition we also provide the results for popular blackbox algorithms publicly available through the Auptimizer toolbox (Liu et al., 2019). Here for all the algorithms we maintain a fixed budget of gradient computations. Additional details on the various algorithm’s parameter settings are provided in Appendix D.Table 2 shows that for both the classification and regression problems MYHPO algorithm significantly outperforms all the baseline algorithms. The dominant convergence behavior of the bilevel formulations (i.e. MYHPO and SHO) compared to blackbox approaches is well known from previous studies (Lorraine and Duvenaud, 2018; MacKay et al., 2019). This is also seen in our results. Of course, increasing the number of gradient computations allows the blackbox approaches achieve similar loss values (see Appendix D). However, a nontrivial observation is that MYHPO significantly outperforms the SHO algorithm. To further analyze this improved convergence behavior of the MYHPO algorithm; we provide the convergence curves of these algorithms for Logistic loss using GTSRB data in Fig 1. Here, MYHPO (BT) involves backtracking to ensure descent direction; whereas MYHPO (C) uses constant step updates. Fig 1 shows that the SHO algorithm obtains best results with the stepsize fixed to . Increasing it further to destabilizes the updates and results to suboptimal SHO solution. On the other hand, using the Moreau Yosida regularized updates we can accomodate for larger stepsizes and hence achieve better convergence rates. Additional improvement can be expected by using backtracking as it ensures descent directions which adds to the stability of the algorithm. This behavior is persistently seen for both the classification and regression problems for all the data sets (see Appendix D).
4.3 Summary
In summary our results confirm the effectiveness of MoreauYosida (MY) regularized updates for bilevel HPO under (illconditioned) limited data settings. Here rather than taking alternating gradient updates as in SHO (Lorraine and Duvenaud, 2018) or STN (MacKay et al., 2019); we propose a modified algorithm MYHPO by taking gradient updates on the MoreauYosida envelope and maintaining a consensus variable. The proposed MYHPO algorithm provides added stability and enables us to use larger step sizes which inturn leads to better convergence rates. Under a fixed computation budget MYHPO significantly outperforms the SHO algorithm (a popular representative for alternating gradient based bilevel HPO solvers); and the widely used blackbox HPO routines. Owing to space constraints details regarding the algorithm parameters for reproducing the experimental results are provided in Appendix D.
Data  SHO 


Random  Grid  HyperOpt  Spearmint  

Regression Problems  
Cookie  No. of gradient computations = 5000  
Train ()  
Val. ()  
Test ()  
MNIST  No. of gradient computations = 6000  
Train()  
Val. ()  
Test ()  
Classification Problems  
MNIST  No. of gradient computations = 1000  
Train ()  
Val. ()  
Test ()  
GTSRB  No. of gradient computations = 1000  
Train ()  
Val. ()  
Test () 
References
 Introduction to nonlinear optimization: theory, algorithms, and applications with matlab. Vol. 19, Siam. Cited by: §3.
 Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. Cited by: §1.
 Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: Appendix C, §3, §3.

Choosing multiple parameters for support vector machines
. Machine learning 46 (13), pp. 131–159. Cited by: §1. 
Adanet: adaptive structural learning of artificial neural networks
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 874–883. Cited by: §1.  Multiclass learning from contradictions. In Advances in Neural Information Processing Systems, pp. 8400–8410. Cited by: §1.
 On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming 55 (13), pp. 293–318. Cited by: §3.
 Forward and reverse gradientbased hyperparameter optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1165–1173. Cited by: §1.
 Bilevel programming for hyperparameter optimization and metalearning. arXiv preprint arXiv:1806.04910. Cited by: §1.
 Linear convergence and metric selection for douglasrachford splitting and admm. IEEE Transactions on Automatic Control 62 (2), pp. 532–544. Cited by: §A.3, Appendix C.
 Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences 7 (3), pp. 1588–1623. Cited by: §3.
 The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Table 1.
 Hyperband: a novel banditbased approach to hyperparameter optimization. The Journal of Machine Learning Research 18 (1), pp. 6765–6816. Cited by: §1.

Auptimizer–an extensible, opensource framework for hyperparameter tuning
. arXiv preprint arXiv:1911.02522. Cited by: §4.2.  Stochastic hyperparameter optimization through hypernetworks. Note: arXiv:1802.09419 Cited by: Appendix B, §1, §1, §2, §4.2, §4.3.
 Selftuning networks: Bilevel optimization of hyperparameters using structured bestresponse functions. In International Conference on Learning Representations, Cited by: §1, §1, §2, §4.2, §4.3.
 Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §1.
 Penalty method for inversionfree deep bilevel optimization. arXiv preprint arXiv:1911.03432. Cited by: §1.
 Application of near infrared reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs. Journal of the Science of Food and Agriculture 35 (1), pp. 99–105. Cited by: §4.1, Table 1.
 Variable metric proximal gradient method with diagonal barzilaiborwein stepsize. In ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3597–3601. Cited by: §4.1.
 Hyperparameter optimization with approximate gradient. In Proceedings of the 33nd International Conference on Machine Learning, ICML, External Links: Link Cited by: §1.
 Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
 Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks , pp. –. Note: External Links: ISSN 08936080, Document Cited by: Table 1.
 Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM Journal on Control and Optimization 29 (1), pp. 119–138. Cited by: §3.
Appendix A Proofs
a.1 Proof of Proposition 3
The proof follows from analyzing the KKT equations. Note that the lagrangian of the problem 2 is,
(8) 
KKT system,
(9)  
(10)  
(11)  
(12) 
At the solution of the KKT System for a bijective mapping we have eq. (12) . This in turn indicates, . Hence at solution, and
a.2 Proof of Proposition 3
Assume the conditions and , are satisfied at iteration . Then the steps gives,
(13)  
(14)  
(15)  
(16) 
Under the conditions the updates in eq. (13)  (16) becomes,
(17)  
In addition at iteration we also have from eq. (4) a unique solution for
Also we have from our assumptions,
(18)  
(19) 
Next for the necessary part if , the primary constraint in (3) is not satisfied. Hence is a necessary condition. To establish the necessary condition , consider and .
Now, (from above assumption). Finally, Step 1 of Algorithm 3 ensures, . Hence .
a.3 Proof for Claim 1
For the first part of the proof observe that the steps are exactly the ADMM updates for a given . This allows us to reutilze the Proposition 4 in (Giselsson and Boyd, 2016) and claim that the operator equivalent to the steps is contractive. The exact form of this new operator and the equivalent convergence rates will be analyzed in a longer version of this work.
For the second part observe that the state of the system is determined by the variables . At stationary points the states remain the same. This gives,
From (16), 
Appendix B Stochastic Hyperparameter Optimization using Hypernetworks (SHO)
There are several versions (global vs. local) of the SHO algorithm introduced in (Lorraine and Duvenaud, 2018). The representative local version of the SHO algorithm is provided below in Algorithm 2.
learning rate of training loss gradient update w.r.t model parameters.  
learning rate of validation loss gradient update w.r.t hyperparameters. 
Appendix C Moreau Yosida regularized (MY)HPO Algorithm
Step 4. Update consensus 
Connection with ADMM : The MYHPO updates are closely connected to the Alternating Direction Method of Multipliers (ADMM) algorithm (Boyd et al., 2011). In fact, for a given the steps are exactly the ADMM updates. However, the algorithm is fundamentally different from ADMM. For example, unlike ADMM, interchanging the steps and completely changes the stationary points. Moreover, the updates do not transform to the Douglas Rachford splitting operator as traditionally seen for ADMM (Giselsson and Boyd, 2016). Still, the similarities of MYHPO with ADMM enables us to modify previous convergence analyses in (Giselsson and Boyd, 2016) to obtain Proposition 3 and Claim 1.
Relaxation to simplified MYHPO algorithm 1 : Algorithm 1 simplifies the above algorithm 3 by taking one gradient update rather than solving the minimization problems above. This reduces the perstep iteration cost of the algorithm 1 to be the same as the SHO algorithm in 2. Now, both the Algorithms 1 and 2 incurs 2gradient steps per outer iteration. This simplification however deteriorates the convergence rate of the algorithm. We provide the convergence comparison between the Algorithms 1 vs. 3 in Table 3.
Appendix D Algorithm Parameters and Additional Results
For all the blackbox approaches we match the number of outeriterations of the bilevel formulations i.e. to train the model for any given parameter. Further, we train the models using gradient updates with step size . This parameter is set specific to each problems reported below, and helps us achieve similar training gradient as the bilevel formulations. In addition, we set the following parameters using the Auptimizer toolbox for each of the following algorithms.
Random Search: We select the same as that used to generate the data. Additionally, we search in the range . We report the performance of the algorithm for varying number of search parameters .
Grid Search: We select the range of search as, . We report the performance of the algorithm for varying number of search parameters .
HyperOpt: We select the same as that used to generate the data. We select the range of search as, and the engine = ‘tpe’. We report the algorithm’s performance for varying values of .
Spearmint: We select the range of search as, , engine = ‘GPEIOptChooser’ and . We report the algorithm’s performance for varying values of .
The parameters specific to the problems and the respective datasets are provided below.
d.1 Regression using Cookie Data
d.1.1 Experiment Parameters
Stochastic Hypernetwork Optimization (SHO): We keep the following parameter fixed (to default values), , noise variance . Changing these values for our experiments did not result in significant improvements. We report the performance of SHO for varying values of . The code is publicly available at: https://github.com/lorraine2/hypernethypertraining
Moreau Yosida Regularized HPO (MYHPO): We keep the following parameter fixed, and report the results for varying . It is wellknown in ADMM literature that the selection of parameter greatly impacts the convergence behavior. Such analyses will be explored in a longer version of the paper.
Further we fix (i.e. MaxIters in Algorithm 3) , and step size for training the model used for blackbox approaches as .
d.1.2 Additional Results
The complete set of results with all the parameter settings are provided in Table 4. As seen from the results increasing the budget (number of gradient computations ) to almost that of the bilevel counterparts allows the blackbox approaches achieve similar performance.
Method 






SHO () 


SHO () 


SHO () 


MyHPO (C) () 


MyHPO (C) () 


MyHPO (BT) () 


MyHPO (BT) () 


MyHPO (BT) () 


Random () 


() 


Grid () 


() 


HyperOpt () 


() 


Spearmint () 


() 

The figure 2 illustrate the convergence curve of the algorithms. Fig. 2 illustrates the unstable behavior of SHO as the step size increases. Note that, for larger step sizes the SHO algorithm fails to converge. As the step size gets smaller the algorithm converges; but demonstrate a very slow convergence rate. This is also seen for the other experiments used in this paper. On the contrary MYHPO can accomodate higher step size and results to better convergence. Additionally, 2 also provides a comparison between the MYHPO algorithm in 3 vs. the simplified version of algorithm 1. As seen in the figure (also confirmed in Table 3), the simplified algorithm provides better perstep iteration cost but incurs much higher overall iterations to convergence.
d.2 Regression using MNIST Data
d.2.1 Experiment Parameters
Stochastic Hypernetwork Optimization (SHO): We fix the parameters same as in D.1.1. We vary the parameter as shown in Table 5.
Moreau Yosida Regularized HPO (MYHPO): We fix the parameters same as in D.1.1. We vary the parameters as shown in Table 5.
Further for this data we fix (i.e. MaxIters in Algorithm 3) , and step size for training the model used for blackbox approaches as .
d.2.2 Additional Results
Method 






SHO () 


SHO () 


SHO () 


MyHPO (C) () 


MyHPO (C) () 


MyHPO (BT) () 


MyHPO (BT) () 


MyHPO (BT) () 


Random () 


Grid () 


HyperOpt () 


Spearmint () 

Note that, here further increasing the computation budget (i.e. ) does not provide additional improvement for the blackbox approaches. The results provide similar conclusions,

[nosep]

MYHPO accomodates for higher step sizes and results to better convergence.

Backtracking further ensures descent direction (in each iteration) and leads to better convergence.

Increasing the budget (number of gradient computations ) to almost that of the bilevel counterparts allows the blackbox approaches achieve similar performance.
d.3 Classification using MNIST Data
d.3.1 Experiment Parameters
Stochastic Hypernetwork Optimization (SHO): We fix the parameters same as in D.1.1. We vary the parameter as shown in Table 6.
Moreau Yosida Regularized HPO (MYHPO): We fix the parameters same as in D.1.1. We vary the parameters as shown in Table 6.
Further for this data we fix (i.e. MaxIters in Algorithm 3) , and step size for training the model used for blackbox approaches as .
d.3.2 Additional Results
Method 






SHO () 


SHO () 


SHO () 


MyHPO (C) () 


MyHPO (C) () 
