Stabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization

07/27/2020 ∙ by Sauptik Dhar, et al. ∙ LG Electronics Inc 0

This research proposes to use the Moreau-Yosida envelope to stabilize the convergence behavior of bi-level Hyperparameter optimization solvers, and introduces the new algorithm called Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) algorithm. Theoretical analysis on the correctness of the MY-HPO solution and initial convergence analysis is also provided. Our empirical results show significant improvement in loss values for a fixed computation budget, compared to the state-of-art bi-level HPO solvers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Successful application of any machine learning (ML) algorithm heavily depends on the careful tuning of its hyper-parameters. However, hyper-parameter optimization (HPO) is a non-trivial problem and has been a topic of research for several decades. Most existing works broadly adopt one of the following two approaches, a) Black-Box optimization or, b) Direct optimization. The Black-Box optimization approach is agnostic of the underlying ML model and adopts an advanced search algorithm to select the hyper-parameters that minimizes the validation error. Popular examples include, Random search, Grid search, Bayesian Optimization based approaches

(Bergstra et al., 2013; Snoek et al., 2012) or other advanced exploration techniques like (Li et al., 2017) etc. The advantage is that it is applicable to any machine learning problem. However, it does not utilize the structure of the underlying ML model, which can be exploited for faster solutions. The Direct Optimization utilizes the underlying ML algorithm structure by parameterizing the validation error as a function of the hyper-parameters, and directly optimizes it. Typical examples include, bi-level hyper-parameter optimization (Lorraine and Duvenaud, 2018; MacKay et al., 2019; Pedregosa, 2016; Franceschi et al., 2017; Maclaurin et al., 2015; Franceschi et al., 2018; Mehra and Hamm, 2019) and bound based analytic model selection (Chapelle et al., 2002; Dhar et al., 2019; Cortes et al., 2017) etc. Such approaches can provide significant computation improvements for HPO. However the limitations of these approaches include, a) the gradients of the validation error w.r.t the hyperparameters must exist and b) instability of the optimization problem under ill-conditioned settings. In this paper we address this unstable convergence behavior of the bi-level HPO approaches for ill-conditioned problems, and propose a new framework to improve it. The main contributions of this work are,

  1. We propose a new algorithm called Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) to stabilize bi-level HPO for ill-conditioned problems.

  2. We provide theoretical analysis on the correctness of the proposed MY-HPO algorithm and also provide some initial convergence analysis.

  3. Finally, we provide extensive empirical results in support of our proposed approach.

The paper is structured as follows. Section 2 introduces the bi-level formulation for HPO and discusses the existing issues of the state-of-art bi-level HPO solvers (Lorraine and Duvenaud, 2018; MacKay et al., 2019) under ill-conditioned settings. Section 3 introduces the Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) algorithm as a solution to stabilize bi-level HPO for such settings. Additional theoretical analysis of its solution’s correctness and convergence behavior is also provided. Empirical results and discussions are provided in section 4.

2 Bi-Level Hyperparameter Optimization (HPO)

The standard bi-level formulation for hyperparameter optimization (HPO) is given as,

(1)

Here, Validation loss, Training loss, Hyperparameters,   Model parameters. A popular approach involves introducing a best-response function (Lorraine and Duvenaud, 2018; MacKay et al., 2019) to rather solve,

(2)

is the best-response function parameterized by the hyperparameters . For simplicity we limit our discussion to only scalar (single) . Of course, it can be easily extended to multiple hyperparameters. The work on Stochastic Hyperparameter Optimization (SHO) (Lorraine and Duvenaud, 2018) use a hypernetwork to model the best-response function as , where and ; with ; and adopts an alternate minimization of the training loss w.r.t (hypernetwork parameters), and validation loss w.r.t (hyperparameters) to solve the HPO problem. Their proposed algorithm is provided in Appendix B. MacKay et al. (2019) adopts a similar alternating gradient approach but modifies the algorithm hypothesis class (by scaling and shifting the network hidden layers) and adds additional stability through adaptively tuning the degree of stochasticity (i.e. perturbation scale) for the gradient updates. In this work we adopt the SHO algorithm 2 as a representative for such alternating gradients approaches for solving the bi-level HPO.

One major limitation of these alternating gradient based bilevel HPO algorithms is that, the convergence behavior heavily depends on the conditioning of the problem. Basically, for ill-conditioned settings, the step-size used for gradient updates need to be sufficiently small to ensure stability of such algorithms. This in turn leads to poor convergence rates (also see in our results section 4). To alleviate this instability and ensure improved convergence we introduce our Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) algorithm. For simplicity we will assume unique solutions for (2) in the rest of the paper.

3 Moreau-Yosida regularized HPO

First we reformulate the problem in (2). Given, we solve,

(3)
s.t.

Here, we assume that and are provided to us by an oracle satisfying at solution . Proposition 3 justifies solving (3) in lieue of (2).

For a bijective mapping , the stationary points of (3) are also stationary points of (2) with and .

The problem (3) is a sum of two functions parameterized by different arguments connected through an equality constraint. Such formulations are frequently seen in machine learning problems and popularly solved through variants of Alternating Direction Method of Multipliers (Boyd et al., 2011; Goldstein et al., 2014), Alternating Minimization Algorithm (Tseng, 1991) or Douglas Rachford Splitting (Eckstein and Bertsekas, 1992) etc. However, a major difference in (3) is that the oracle solution and is not available a priori. Hence, we cannot directly apply these existing approaches to solve (3). This leads to our new algorithm called Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) to solve (3). At iteration we take the steps,
Step 1. Update

(4)
(5)
(6)
Step 4. Update consensus (7)

The complete algorithm is provided in Appendix C (Algorithm 3). Step , ensures a unique solution for given i.e. . In Steps and rather than taking the gradient updates of (as in SHO); we take the gradient of the Moreau-Yosida (MY) regularized functions. The Moreau-Yosida (MY) regularization of a function is defined as and serves as a smooth approximate for . Note that (5) and (6) transforms to gradient updates of the MY regularized and in its scaled form (Boyd et al., 2011). This lends to better stability of these updates in Steps ; and is highly desirable for ill-conditioned problems. Another aspect of this algorithm is that now and updates are not agnostic of each other. The

terms constrains against larger steps in a direction detrimental to either of the loss functions

. In essence these additional terms maintain consensus between the updates while minimizing w.r.t , and w.r.t separately. Any deviation from this consensus is captured in -update and fed back in the next iteration. The user-defined, controls the scale of these augmented terms and hence convergence of the algorithm. The proposed algorithm has the following convergence properties,

(Convergence Criteria) The necessary and sufficient conditions for Steps to solve (3) are, and . Further we make the following claim on the algorithm’s convergence to its stationary points.

Claim 1

(Convergence Guarantee) Under the assumptions that , are proper, closed and convex; with strongly convex and is bounded . The steps converges to a stationary point. Further at stationary point we have and .

All proofs are provided in Appendix. To maintain same per-step iteration cost as the SHO algorithm we simplify the steps 1-4 further and rather take single gradient updates in Steps 1-3. This results to the simplified MY-HPO Algorithm 1 which is used throughout our experiments. For simplicity we call this simplified Algorithm 1 as MY-HPO through out the paper. Additionally, to ensure descent direction we also add a backtracking (BT) of step-size scaled by a factor of (see chapter 4 in (Beck, 2014)). The refer to the algorithm with backtracking as MY-HPO(BT) and without it i.e. constant step-size as MY-HPO (C).

Input: MaxIters = Total iterations,   = convergence error tolerance. = step size for gradient updates.
Output:
Initialize: and while ( MaxIters ) and ( )  do
      
end while
Algorithm 1 Simplified MY-HPO Algorithm

4 Results and Summary

4.1 Experimental Settings

We provide analysis for both regression and classification problems using the loss functions,

  • [nosep,leftmargin=*]

  • Least Square: ,

  • Logistic: ,

Here, , and (regression) or (classification), = dimension of the problem, = Training data, = number of training samples and = Validation Data, = number of validation samples. The experimental details are provided in Table 1. Here, for regression, we use the MNIST data to regress on the digit labels ‘0’ - ‘9’ (Park et al., 2020) and the Cookie to predict the percentage of ‘fat’ w.r.t the ‘flour’ content using near infrared region (NIR) values (Osborne et al., 1984)

. For classification, we classify between digits ‘0’ vs. ‘1’ for MNIST, and the traffic signs ‘30’ vs. ‘80’ for GTSRB.

.       Data Sets Training/Validation size Test size Dimension () COOKIE (Osborne et al., 1984) 699 (NIR) MNIST Regression (LeCun, 1998) ( per digit) ( per digit) (Pixel) MNIST Classification (LeCun, 1998) ( per class) ( per class) (Pixel) GTSRB (Stallkamp et al., 2012) ( per class) ( per class) (HOG)

Table 1: Datasets and Experimental Settings

4.2 Results

Table 2 provides the average (standard deviation) of the loss values on training,validation and a separate test data, over 10 runs of the experiments. Here, for each experiment we partition the data sets in the same proportion as shown in Table 1

. For regression we scale the output’s loss with the variance of y

values. This is a standard normalization technique which illustrates the proportion of un-explained variance by the estimated model. For the SHO and MY-HPO algorithms we only report the performance of the models using the best performing step-sizes. A detailed ablation study using different step-sizes for the algorithms is provided in Appendix

D. In addition we also provide the results for popular black-box algorithms publicly available through the Auptimizer toolbox (Liu et al., 2019). Here for all the algorithms we maintain a fixed budget of gradient computations. Additional details on the various algorithm’s parameter settings are provided in Appendix D.

Table 2 shows that for both the classification and regression problems MY-HPO algorithm significantly outperforms all the baseline algorithms. The dominant convergence behavior of the bi-level formulations (i.e. MY-HPO and SHO) compared to black-box approaches is well known from previous studies (Lorraine and Duvenaud, 2018; MacKay et al., 2019). This is also seen in our results. Of course, increasing the number of gradient computations allows the black-box approaches achieve similar loss values (see Appendix D). However, a non-trivial observation is that MY-HPO significantly outperforms the SHO algorithm. To further analyze this improved convergence behavior of the MY-HPO algorithm; we provide the convergence curves of these algorithms for Logistic loss using GTSRB data in Fig 1. Here, MY-HPO (BT) involves back-tracking to ensure descent direction; whereas MY-HPO (C) uses constant step updates. Fig 1 shows that the SHO algorithm obtains best results with the step-size fixed to . Increasing it further to destabilizes the updates and results to sub-optimal SHO solution. On the other hand, using the Moreau Yosida regularized updates we can accomodate for larger step-sizes and hence achieve better convergence rates. Additional improvement can be expected by using back-tracking as it ensures descent directions which adds to the stability of the algorithm. This behavior is persistently seen for both the classification and regression problems for all the data sets (see Appendix D).

4.3 Summary

In summary our results confirm the effectiveness of Moreau-Yosida (MY) regularized updates for bi-level HPO under (ill-conditioned) limited data settings. Here rather than taking alternating gradient updates as in SHO (Lorraine and Duvenaud, 2018) or STN (MacKay et al., 2019); we propose a modified algorithm MY-HPO by taking gradient updates on the Moreau-Yosida envelope and maintaining a consensus variable. The proposed MY-HPO algorithm provides added stability and enables us to use larger step sizes which inturn leads to better convergence rates. Under a fixed computation budget MY-HPO significantly outperforms the SHO algorithm (a popular representative for alternating gradient based bi-level HPO solvers); and the widely used black-box HPO routines. Owing to space constraints details regarding the algorithm parameters for reproducing the experimental results are provided in Appendix D.

Data SHO
MyHPO (C)
MyHPO (BT)
Random Grid HyperOpt Spearmint
Regression Problems
Cookie No. of gradient computations = 5000
Train ()
Val. ()
Test ()
MNIST No. of gradient computations = 6000
Train()
Val. ()
Test ()
Classification Problems
MNIST No. of gradient computations = 1000
Train ()
Val. ()
Test ()
GTSRB No. of gradient computations = 1000
Train ()
Val. ()
Test ()
Table 2: Loss values of different HPO algorithms using fixed computation budget.
Figure 1: Convergence behavior of SHO vs. MY-HPO for different step-sizes for GTSRB data using Logistic loss. For MY-HPO we show the values.

References

  • A. Beck (2014) Introduction to nonlinear optimization: theory, algorithms, and applications with matlab. Vol. 19, Siam. Cited by: §3.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. Cited by: §1.
  • S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: Appendix C, §3, §3.
  • O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee (2002)

    Choosing multiple parameters for support vector machines

    .
    Machine learning 46 (1-3), pp. 131–159. Cited by: §1.
  • C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang (2017)

    Adanet: adaptive structural learning of artificial neural networks

    .
    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 874–883. Cited by: §1.
  • S. Dhar, V. Cherkassky, and M. Shah (2019) Multiclass learning from contradictions. In Advances in Neural Information Processing Systems, pp. 8400–8410. Cited by: §1.
  • J. Eckstein and D. P. Bertsekas (1992) On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming 55 (1-3), pp. 293–318. Cited by: §3.
  • L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1165–1173. Cited by: §1.
  • L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910. Cited by: §1.
  • P. Giselsson and S. Boyd (2016) Linear convergence and metric selection for douglas-rachford splitting and admm. IEEE Transactions on Automatic Control 62 (2), pp. 532–544. Cited by: §A.3, Appendix C.
  • T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk (2014) Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences 7 (3), pp. 1588–1623. Cited by: §3.
  • Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Table 1.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18 (1), pp. 6765–6816. Cited by: §1.
  • J. Liu, S. Tripathi, U. Kurup, and M. Shah (2019)

    Auptimizer–an extensible, open-source framework for hyperparameter tuning

    .
    arXiv preprint arXiv:1911.02522. Cited by: §4.2.
  • J. Lorraine and D. Duvenaud (2018) Stochastic hyperparameter optimization through hypernetworks. Note: arXiv:1802.09419 Cited by: Appendix B, §1, §1, §2, §4.2, §4.3.
  • M. MacKay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse (2019) Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. In International Conference on Learning Representations, Cited by: §1, §1, §2, §4.2, §4.3.
  • D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §1.
  • A. Mehra and J. Hamm (2019) Penalty method for inversion-free deep bilevel optimization. arXiv preprint arXiv:1911.03432. Cited by: §1.
  • B. G. Osborne, T. Fearn, A. R. Miller, and S. Douglas (1984) Application of near infrared reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs. Journal of the Science of Food and Agriculture 35 (1), pp. 99–105. Cited by: §4.1, Table 1.
  • Y. Park, S. Dhar, S. Boyd, and M. Shah (2020) Variable metric proximal gradient method with diagonal barzilai-borwein stepsize. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3597–3601. Cited by: §4.1.
  • F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. In Proceedings of the 33nd International Conference on Machine Learning, ICML, External Links: Link Cited by: §1.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
  • J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks , pp. –. Note: External Links: ISSN 0893-6080, Document Cited by: Table 1.
  • P. Tseng (1991) Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM Journal on Control and Optimization 29 (1), pp. 119–138. Cited by: §3.

Appendix A Proofs

a.1 Proof of Proposition 3

The proof follows from analyzing the KKT equations. Note that the lagrangian of the problem 2 is,

(8)

KKT system,

(9)
(10)
(11)
(12)

At the solution of the KKT System for a bijective mapping we have eq. (12) . This in turn indicates, . Hence at solution, and

 

a.2 Proof of Proposition 3

Assume the conditions and , are satisfied at iteration . Then the steps gives,

(13)
(14)
(15)
(16)

Under the conditions the updates in eq. (13) - (16) becomes,

(17)

In addition at iteration we also have from eq. (4) a unique solution for

Also we have from our assumptions,

(18)
(19)

Using e.q. (18) and (19) into (A.2) gives,

Next for the necessary part if , the primary constraint in (3) is not satisfied. Hence is a necessary condition. To establish the necessary condition , consider and .

Now, (from above assumption). Finally, Step 1 of Algorithm 3 ensures, . Hence .  

a.3 Proof for Claim 1

For the first part of the proof observe that the steps are exactly the ADMM updates for a given . This allows us to re-utilze the Proposition 4 in (Giselsson and Boyd, 2016) and claim that the operator equivalent to the steps is contractive. The exact form of this new operator and the equivalent convergence rates will be analyzed in a longer version of this work.

For the second part observe that the state of the system is determined by the variables . At stationary points the states remain the same. This gives,

From (16),

 

Appendix B Stochastic Hyperparameter Optimization using Hypernetworks (SHO)

There are several versions (global vs. local) of the SHO algorithm introduced in (Lorraine and Duvenaud, 2018). The representative local version of the SHO algorithm is provided below in Algorithm 2.

Input:
learning rate of training loss gradient update w.r.t model parameters.
learning rate of validation loss gradient update w.r.t hyperparameters.
Output:
Data: = Training Data , = Validation Data
Initialize: and, define: ;
while   do
       //

typically modeled as Normal Distribution

       ;
      
end while
Algorithm 2 SHO Algorithm (Local)

Appendix C Moreau Yosida regularized (MY)-HPO Algorithm

Input: MaxIters = Total iterations,   = convergence error tolerance.
Output:
Data: = Training Data , = Validation Data
Initialize: and while ( MaxIters ) and ( )  do
       Step 1. Update
Step 2. Update
Step 3. Update
Step 4. Update consensus
end while
Algorithm 3 MY-HPO algorithm

Connection with ADMM : The MY-HPO updates are closely connected to the Alternating Direction Method of Multipliers (ADMM) algorithm (Boyd et al., 2011). In fact, for a given the steps are exactly the ADMM updates. However, the algorithm is fundamentally different from ADMM. For example, unlike ADMM, interchanging the steps and completely changes the stationary points. Moreover, the updates do not transform to the Douglas Rachford splitting operator as traditionally seen for ADMM (Giselsson and Boyd, 2016). Still, the similarities of MY-HPO with ADMM enables us to modify previous convergence analyses in (Giselsson and Boyd, 2016) to obtain Proposition 3 and Claim 1.

Relaxation to simplified MY-HPO algorithm 1 : Algorithm 1 simplifies the above algorithm 3 by taking one gradient update rather than solving the minimization problems above. This reduces the per-step iteration cost of the algorithm 1 to be the same as the SHO algorithm in 2. Now, both the Algorithms 1 and 2 incurs 2-gradient steps per outer iteration. This simplification however deteriorates the convergence rate of the algorithm. We provide the convergence comparison between the Algorithms 1 vs. 3 in Table 3.

Data Train Loss Val. Loss Test Loss No. of Iterations () MaxIters
Cookie
MNIST
Table 3: Performance of MY-HPO Algorithm 3 for the regression problems in Table 2

Appendix D Algorithm Parameters and Additional Results

For all the black-box approaches we match the number of outer-iterations of the bi-level formulations i.e. to train the model for any given parameter. Further, we train the models using gradient updates with step size . This parameter is set specific to each problems reported below, and helps us achieve similar training gradient as the bi-level formulations. In addition, we set the following parameters using the Auptimizer tool-box for each of the following algorithms.

Random Search: We select the same as that used to generate the data. Additionally, we search in the range . We report the performance of the algorithm for varying number of search parameters .

Grid Search: We select the range of search as, . We report the performance of the algorithm for varying number of search parameters .

HyperOpt: We select the same as that used to generate the data. We select the range of search as, and the engine = ‘tpe’. We report the algorithm’s performance for varying values of .

Spearmint: We select the range of search as, , engine = ‘GPEIOptChooser’ and . We report the algorithm’s performance for varying values of .

The parameters specific to the problems and the respective data-sets are provided below.

d.1 Regression using Cookie Data

d.1.1 Experiment Parameters

Stochastic Hypernetwork Optimization (SHO): We keep the following parameter fixed (to default values), , noise variance . Changing these values for our experiments did not result in significant improvements. We report the performance of SHO for varying values of . The code is publicly available at: https://github.com/lorraine2/hypernet-hypertraining

Moreau Yosida Regularized HPO (MY-HPO): We keep the following parameter fixed, and report the results for varying . It is well-known in ADMM literature that the selection of parameter greatly impacts the convergence behavior. Such analyses will be explored in a longer version of the paper.

Further we fix (i.e. MaxIters in Algorithm 3) , and step size for training the model used for black-box approaches as .

d.1.2 Additional Results

The complete set of results with all the parameter settings are provided in Table 4. As seen from the results increasing the budget (number of gradient computations ) to almost that of the bi-level counterparts allows the black-box approaches achieve similar performance.

Method
Train Loss
()
Validation Loss
()
Test Loss
()
()
SHO ()
()
SHO ()
()
SHO ()
()
MyHPO (C) ()
()
MyHPO (C) ()
()
MyHPO (BT) ()
()
MyHPO (BT) ()
()
MyHPO (BT) ()
()
Random ()
()
    ()
()
Grid   ()
()
    ()
()
HyperOpt ()
()
     ()
()
Spearmint ()
()
     ()
()
Table 4: Comparison between different HPO algorithms for regression problem using Cookie data.
Figure 2: Convergence behavior of SHO vs. MY-HPO for different step-sizes for Cookie data using Least Square loss. For MY-HPO we show the values.

The figure 2 illustrate the convergence curve of the algorithms. Fig. 2 illustrates the unstable behavior of SHO as the step size increases. Note that, for larger step sizes the SHO algorithm fails to converge. As the step size gets smaller the algorithm converges; but demonstrate a very slow convergence rate. This is also seen for the other experiments used in this paper. On the contrary MY-HPO can accomodate higher step size and results to better convergence. Additionally, 2 also provides a comparison between the MY-HPO algorithm in 3 vs. the simplified version of algorithm 1. As seen in the figure (also confirmed in Table 3), the simplified algorithm provides better per-step iteration cost but incurs much higher overall iterations to convergence.

d.2 Regression using MNIST Data

d.2.1 Experiment Parameters

Stochastic Hypernetwork Optimization (SHO): We fix the parameters same as in D.1.1. We vary the parameter as shown in Table 5.

Moreau Yosida Regularized HPO (MY-HPO): We fix the parameters same as in D.1.1. We vary the parameters as shown in Table 5.

Further for this data we fix (i.e. MaxIters in Algorithm 3) , and step size for training the model used for black-box approaches as .

d.2.2 Additional Results

Method
Train Loss
()
Validation Loss
()
Test Loss
()
()
SHO ()
()
SHO ()
()
SHO ()
()
MyHPO (C) ()
()
MyHPO (C) ()
()
MyHPO (BT) ()
()
MyHPO (BT) ()
()
MyHPO (BT) ()
()
Random ()
()
Grid   ()
()
HyperOpt ()
()
Spearmint ()
()
Table 5: Comparison between different HPO algorithms for regression problem using MNIST data.
Figure 3: Convergence behavior of SHO vs. MY-HPO for different step-sizes for MNIST data using Least Square loss. For MY-HPO we show the values.

Note that, here further increasing the computation budget (i.e. ) does not provide additional improvement for the black-box approaches. The results provide similar conclusions,

  1. [nosep]

  2. MY-HPO accomodates for higher step sizes and results to better convergence.

  3. Back-tracking further ensures descent direction (in each iteration) and leads to better convergence.

  4. Increasing the budget (number of gradient computations ) to almost that of the bi-level counterparts allows the black-box approaches achieve similar performance.

d.3 Classification using MNIST Data

d.3.1 Experiment Parameters

Stochastic Hypernetwork Optimization (SHO): We fix the parameters same as in D.1.1. We vary the parameter as shown in Table 6.

Moreau Yosida Regularized HPO (MY-HPO): We fix the parameters same as in D.1.1. We vary the parameters as shown in Table 6.

Further for this data we fix (i.e. MaxIters in Algorithm 3) , and step size for training the model used for black-box approaches as .

d.3.2 Additional Results

Method
Train Loss
()
Validation Loss
()
Test Loss
()
()
SHO ()
()
SHO ()
()
SHO ()
()
MyHPO (C) ()
()
MyHPO (C) ()