Multi-Task Multicriteria Hyperparameter Optimization

02/15/2020 ∙ by Kirill Akhmetzyanov, et al. ∙ 32

We present a new method for searching optimal hyperparameters among several tasks and several criteria. Multi-Task Multi Criteria method (MTMC) provides several Pareto-optimal solutions, among which one solution is selected with given criteria significance coefficients. The article begins with a mathematical formulation of the problem of choosing optimal hyperparameters. Then, the steps of the MTMC method that solves this problem are described. The proposed method is evaluated on the image classification problem using a convolutional neural network. The article presents optimal hyperparameters for various criteria significance coefficients.



There are no comments yet.


page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hyperparametric optimization [9]

is an important component in the implementation of machine learning models (for example, logistic regression, neural networks, SVM, gradient boosting, etc.) in solving various tasks, such as classification, regression, ranking, etc. The problem is how to choose the optimal parameters when a trained model is evaluated using several sets and several criteria.

This article describes an approach to solving the problem described above. We will present the results of experiments on the selection of hyperparameters obtained using the proposed approach (MTMC) with various criteria significance coefficients.

The article is organized as follows. First, we discuss related work in Section 2. Section 3 describes the proposed method. Section 4 presents the results of experiments on the selection of optimal hyperparameters. Section 5 contains the conclusion and future work.

2 Related Work

The problem of choosing optimal hyperparameters has long been known. Existing methods for solving this problem give both the single optimal solution, and several ones.

In [12]

, a Pareto optimization method is proposed, in which the optimal solution is given for several problems simultaneously. This method consists in minimizing the weighted sum of loss functions for each of the tasks.

[7] describes the Pareto optimization method, which gives an optimal solution according to several criteria based on gradient descent, and this optimization is also carried out in the learning process. In [10], the search for a Pareto-optimal solution is carried out according to several criteria. The method in [3] gives optimal hyperparameters using back propagation through the Cholesky decomposition. In [5], optimization is performed using a random choice of hyperparameters based on the expected improvement criterion. [4] proposes method of hyperparametric optimization based on random search in the space of hyperparameters. In [14], search for optimal hyperparameters is carried out using Bayesian optimization.

The novelty of MTMC method:

1. Optimization is carried out simultaneously according to several criteria and several tasks with setting the significance of the criteria.

2. The choice of optimal hyperparameters is provided after training and evaluation, which eliminates the need to re-train the model.

3. The proposed method does not need to be trained.

3 Optimization of Hyperparameters According to Several Tasks and Several Criteria

First, we describe the mathematical problem that MTMC solves, then we present the steps performed in MTMC.

3.1 Formalization of the problem

In the proposed method, the model is evaluated on several test sets (tasks) . The problem of finding a minimum for tasks is known as minimizing the expected value of empirical risk [16].

The choosing optimal hyperparameters is formalized as follows:


where – the set of all hyperparameters, – the selected optimal hyperparameters,

– the vector of significance coefficients of the criteria,

– the estimation function of the model with the given hyperparameters

and the coefficients , – the task for which optimization is performed.

The developed method gives a solution to the problem (1).

3.2 Description of MTMC

According to (1), the developed method should fulfill the following requirements:

1) the method should solve the minimization problem;

2) the significance of each criterion is determined by the vector of coefficients (the higher the coefficient, the more important the corresponding criterion).

We denote the test sample of the task :


where – the test set has the distribution , – the number of tasks.

Before choosing hyperparameters, for model we obtain an evaluation matrix for the test set and the given evaluation criteria:


where – the model function that transforms the given set and with the given hyperparameters into the evaluation matrix , – the number of criteria, – the dimension of the test set, – the number of hyperparameters, – the number of hyperparameter combinations.

Then, the function is calculated for each set , which is formally described as follows:


MTMC method gives Pareto optimal solutions in which the following steps are performed:

1. The vectors from the evaluation (the number of such vectors is ) is in the space of given criteria.

2. Then we get Pareto optimal solutions – the nearest Pareto front to the origin of the criteria space:


where – the number of Pareto optimal solutions.

3. The optimal solutions are scaled according to each criterion to the interval :


where – the vector of maximum values of for each criterion, – the vector of minimum values of for each criterion.

Thus, the optimal solution is the solution closest to the origin, and if any solution is the origin, then it is optimal for any .

4. The vector in the space of criteria is defined.

We introduce the vector of the optimal solution, which is the middle of the segment in the axes of the criteria space:


Conditions for are:


5. Project the vectors from the matrix onto the vector :


From (9) and (11) it follows that if the vectors and are collinear, then:


That is, in the case of equality of all elements of , the minimization problem reduces to finding the minimum -norm .

From (11) it also follows that if some component of the vector is equal to zero, then the corresponding criterion will not affect the choice of the optimal hyperparameter. If all criteria are equal to zero, except for one, then only the criterion with a nonzero component of the vector will affect the choice of optimal hyperparameters.

6. We find hyperparameters at which the minimum of the vector is reached:

Figure 1: Example of a solution given by MTMC, green points denote Pareto optimal solutions, blue vector is the vector , yellow point denotes the optimal solution given by MTMC for a given .

Fig. 1 shows an example solution using MTMC for random numbers in the three-dimensional space.

4 Conducting experiments

First, the evaluation matrix for the selected model is obtained. Then, for various combinations of components of , optimal hyperparameters are selected using MTMC.

4.1 Obtaining the evaluation matrix

The developed MTMC method is applied to solve the problem of image classification. The problem we are solving is described in the article [2].

In [1], we selected the MobileNet neural network architecture [8] as a mathematical model for image processing.

The search for optimal hyperparameters was carried out among two popular training methods: changing the learning speed based on the epoch

(where is the initial learning rate, is the coefficient of decreasing learning rate, epoch is the number of epochs) and cyclical learning [13]. In cyclic learning, there are three ways to change the learning rate:

1. triangular – fixed initial learning rate (base_lr), maximum fixed learning rate (max_lr), learning rate increases from base_lr to max_lr and decreases from max_lr to base_lr linearly.

2. triangular2 – fixed initial learning rate (base_lr), maximum learning rate (max_lr), learning rate, as in triangular, varies linearly, but max_lr in the learning process is halved.

3. exp_range – fixed initial learning rate (base_lr), maximum learning rate (max_lr), learning rate also changes linearly, but max_lr in the learning process decreases exponentially.

In the first learning method, the hyperparameters are the value of the initial learning rate (base_lr) and the coefficient of decreasing learning rate (lr_decay). In the second method, hyperparameters – a way to change the learning rate (cyclic_mode), the value of the initial learning rate (base_lr) and maximum learning rate (max_lr).

For each hyperparameter, a range of change and a constant step of change within the range were selected. For training, Grid search was used among combinations of hyperparameters.

For each combination of hyperparameters, training was carried out using cross-validation k-fold [15]

with 10 folds. For training, Keras framework was used 

[11]. The training lasted 15 epochs; the test was carried out on different test sets. That is, is number of different neural networks, neural networks evaluations are conducted, evaluation results are obtained. Neural networks trained on ten TPUs [6], which took several days.

Among all epochs, for each fold and for each test set, the maximum accuracy is selected, as well as the number of the epoch at which the maximum accuracy is achieved. The following values are calculated for each test set among the folds: the expected value and the variance of the classification error, the expected value and the variance of the epoch number at which convergence on the test set is achieved. We have obtained an evaluation matrix among all neural networks with their hyperparameters and among all test samples.

4.2 Processing the evaluation matrix

Based on (1), for each criterion, among all the samples, the expected value is considered. That is, for all test sets, the criteria: (i) the sample mean of the classification error, (ii) the sample variance of the classification error, (iii) the sample mean and (iv) sample variance of the epoch number at which convergence is achieved in the test sample. These values are the criteria for evaluating hyperparameters for a certain test set (matrix from (3)) with the number of criteria .

is calculated from (7), the number of Pareto optimal solutions obtained is . Optimal hyperparameters, i.e., , are presented in Appendix A.

The vector of the optimal solution according to (9) for is . Next, calculations are carried out according to (8) and (11), and for various optimal solutions are chosen according to (5). These optimal solutions are presented in Appendix B. Also, Appendix C shows the graphs of changes in the accuracy of the neural network for each test set and each epoch for the obtained optimal hyperparameters.

5 Conclusion

In this work, we proposed a new method for hyperparameter optimization among several tasks and several criteria. We trained several neural networks with various hyperparameters to solve the image classification problem. Then, for these neural networks, evaluation matrices were obtained on several tasks. We applied MTMC to these matrices and got optimal solutions with different significance coefficients. In the future, we will work to create a meta-learning method that solves the same problem as the method described in this article, but optimization will be performed among various models.

Acknowledgments: The reported study was partially supported by the Government of Perm Krai, research project No. C-26/174.6.


  • [1] K. Akhmetzyanov and A. Yuzhakov (2018) Convolutional neural networks comparison for waste sorting tasks. Izvestiya SPbGETU ”LETI” (6), pp. 27. Cited by: §4.1.
  • [2] K. Akhmetzyanov and A. Yuzhakov (2019) Waste sorting neural network architecture optimization. In 2019 International Russian Automation Conference (RusAutoCon), pp. 1–5. Cited by: §4.1.
  • [3] Y. Bengio (2000) Gradient-based optimization of hyperparameters. Neural computation 12 (8), pp. 1889–1900. Cited by: §2.
  • [4] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §2.
  • [5] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: §2.
  • [6] (accessed 19.01.2020) Cloud TPU. Note: Cited by: §4.1.
  • [7] J. Fliege and B. F. Svaiter (2000) Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research 51 (3), pp. 479–494. Cited by: §2.
  • [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4.1.
  • [9] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stützle (2009) ParamILS: an automatic algorithm configuration framework.

    Journal of Artificial Intelligence Research

    36, pp. 267–306.
    Cited by: §1.
  • [10] C. Igel (2005)

    Multi-objective model selection for support vector machines

    In International Conference on Evolutionary Multi-Criterion Optimization, pp. 534–546. Cited by: §2.
  • [11] (accessed 19.01.2020)

    Keras: The Python Deep Learning library

    Note: Cited by: §4.1.
  • [12] O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 527–538. Cited by: §2.
  • [13] L. N. Smith (2017) Cyclical learning rates for training neural networks. In

    2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 464–472. Cited by: §4.1.
  • [14] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §2.
  • [15] M. Stone (1974) Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36 (2), pp. 111–133. Cited by: §4.1.
  • [16] V. Vapnik (1992) Principles of risk minimization for learning theory. In Advances in neural information processing systems, pp. 831–838. Cited by: §3.1.

Appendix A Pareto Optimal Hyperparameters

Table 1 and Table 2 show the Pareto optimal hyperparameters for the two learning methods.

base_lr lr_decay
0.001 0.75
0.001 0.8
0.005 0.75
0.01 0.9
0.01 0.95
Table 1: Pareto optimal solutions for the first learning method.
base_lr max_lr cyclic_mode
0.0001 0.005 exp_range
0.0001 0.005 triangular2
0.0005 0.001 exp_range
0.0005 0.005 triangular2
0.001 0.0001 triangular2
0.001 0.0005 triangular
0.001 0.001 exp_range
0.001 0.005 triangular2
0.005 0.0001 triangular
0.005 0.005 triangular
0.01 0.0001 triangular
0.01 0.0001 triangular2
0.01 0.005 triangular
0.01 0.005 triangular2
0.01 0.01 triangular
0.0001 0.0001 triangular
0.0005 0.001 triangular
0.0005 0.01 triangular2
0.001 0.005 triangular
0.005 0.01 triangular
Table 2: Pareto optimal solutions for the first learning method.

Appendix B MTMC Optimal Hyperparameters

Table 3 shows the optimal hyperparameters obtained using MTMC method for given significance coefficients of the criteria .

0.5 0.5 0.5 0.5 base_lr=0.01,
0.0 0.5 0.5 0.5 base_lr=0.01,
1.0 0.5 0.5 0.5 base_lr=0.01,
0.5 0.0 0.5 0.5 base_lr=0.005,
0.5 1.0 0.5 0.5 base_lr=0.01,
0.5 0.5 0.0 0.5 base_lr=0.01,
0.5 0.5 1.0 0.5 base_lr=0.01,
0.5 0.5 0.5 0.0 base_lr=0.0001,
0.5 0.5 0.5 1.0 base_lr=0.01,
0.0 0.0 0.5 0.5 max_lr=0.005,
1.0 1.0 0.5 0.5 base_lr=0.01,
0.5 0.5 0.0 0.0 base_lr=0.0001,
0.5 0.5 1.0 1.0 base_lr=0.01,
1.0 0.0 0.0 0.0 base_lr=0.0001,
0.0 1.0 0.0 0.0 base_lr=0.0001,
0.0 0.0 1.0 0.0 max_lr=0.005,
0.0 0.0 0.0 1.0 base_lr=0.0005,
Table 3: Optimal hyperparameters for relevant criteria significance coefficients.

Appendix C Accuracy of MTMC Optimal Solutions

The figures below show optimal solutions chosen by MTMC method: 95% confidence intervals of the dependence of accuracy on the fold / epoch and maximum accuracy for each of the folds (

left) and “box plot” with maximum accuracy for all test sets (right).

Figure 2: max_lr=0.005, lr_decay=0.75
Figure 3: base_lr=0.0001, max_lr=0.005, cyclic_mode=triangular2
Figure 4: base_lr=0.0005, max_lr=0.001, cyclic_mode=exp_range
Figure 5: base_lr=0.005, max_lr=0.0001, cyclic_mode=triangular
Figure 6: base_lr=0.01, max_lr=0.01, cyclic_mode=triangular