. Also, the feature selection problem is naturally multi-objective since the selected subset of features has to be non-redundant and gives accurate and stable trained model, simultaneously[xue2012particle, katrutsa2015stress]
. The exploiting of a multi-objective optimization approach in reinforcement learning problem is discussed in[liu2014multiobjective, van2014multi].
loss functions in multi-objective optimization problem. These losses depend on the same vector. We consider the case of differentiable loss functions . Since the considered losses can be conflicting, i.e. a minimizer of one loss is not a minimizer of another loss, the solution of the multi-objective minimization problem is defined in the following way.
Definition 1 (Pareto-optimal solution)
A point dominates for a multi-objective optimization problem, if for and at least one inequality is strict.
A point is called Pareto-optimal solution, if there is no other point that dominates it.
One of the standard approaches to find Pareto-optimal solution is scalarization [ghane2015new], e.g. weighting sum method that minimizes the conical combination of the losses:
where are static or dynamically updated weights of each loss. The survey on the scalarization approach to solving multi-objective optimization problems is given in [miettinen2002scalarizing]. Note that the weighting sum method (1) gives different Pareto-optimal points for different weights . The drawbacks of this approach to identify Pareto-optimal points are discussed in [das1997closer].
Multiple gradient descent algorithm
Another approach to finding the Pareto-optimal point is multiple-gradient descent algorithm (MGDA) [desideri2012multiple]. The main idea of this algorithm is to find in every iteration descent direction for the considered losses . The procedure to find this direction is based on the necessary condition for the point to be a Pareto-optimal solution. This condition is called Pareto stationarity.
Definition 2 (Pareto stationarity)
Given differentiable losses , the point is called Pareto-stationary, iff there exists a convex combination of gradients equals to zero, i.e.
Based on this necessary condition, the descent direction in MGDA is a convex combination of gradients whose norm is minimal. Thus, the following optimization problem has to be solved
where . Denote by the resulting convex combination of , i.e. . If , then is a Pareto-stationary point. Otherwise, direction is a descent direction for the losses [desideri2012multiple]. Therefore, MGDA updates vector , where is the iteration number, similar to the gradient descent for a single-objective optimization problem:
where is a learning rate. Figure 1 illustrates the position of the direction in the case of two gradients and .
This choice of the descent direction leads to the convergence to the Pareto-stationary point, such that it is the closest to the initial point among all other Pareto-stationary points. Thus, in contrast to the weighting sum method (1), MGDA converges to a particular Pareto-stationary point.
One more property of the direction is the following (see Theorem 2.2. in [desideri2012multiple]): if this direction belongs to the interior of the convex hull of the gradients , it satisfies
It means that each loss gets approximately the same absolute decrease after updating according to (4) since
where is the direction in the -th iteration.
This is the starting point for our method: if the individual losses are not balanced (e.g. lets multiply one loss by a large factor), the Pareto front does not change, however the solution of (3) changes significantly, compare in Figures 0(a) and 0(b). Instead, we propose to look for the direction that gives the same relative decrease:
or in other words, the direction has the same angle to the gradients (see Theorem 1). Therefore, the proposed method is called Equiangular Direction Method (EDM). For two gradients , the direction is a bisector of the angle between , see Figure 1.
2 Equiangular Direction Method (EDM)
The key step of the EDM is to find the direction that has the same angle with individual gradients . This direction can be naturally written as a convex combination of the normalized gradients:
where the coefficients are the solution of the following convex optimization problem:
The following Theorem shows the main characteristic property of this direction.
The direction defined in (7) satisfies the following equality
for all such that . If for all , the direction has the same angle with all individual gradients, i.e. this direction and gradients are equiangular with some angle .
Let be the index set such that is equivalent to . After that, the problem (8) can be re-written as
The solution uniquely defines in the following way:
Now we can write Lagrangian and derive KKT optimality conditions for this optimization problem. Note that the constraint is used in KKT conditions implicitly since by construction there exists solution that satisfies this constraint. Therefore, the Lagrangian is the following function . From the KKT optimality conditions follows that the gradient of with respect to has to be zero in and for :
since only indices from the set correspond to nonzero elements of (9). Thus we get the equality for any index such that :
Now we can compute the value of the remaining factor . Consider the following chain of equalities:
Thus, the equality (10) can be re-written in the final form:
where is any index such that .
2.1 Normalization of the equiangular direction
To define the direction only angles between gradients are important. But to define the proper norm of this direction, we need additional assumptions. If we have only two gradients and and their norms are equal, then it is natural to require that the normalized vector coincides with the vector , i.e., the vector has to belong to the convex hull of gradients and . To satisfy this requirement, the scale factor and the corresponding vector have the following forms:
For there is an explicit formula for the bisector direction:
that is formally not equal to the solution of the problem (8), but only up to a normalization factor. Now to get the normalized vector that belongs to the convex hull of the gradients we use the scale factor and obtain the final form of the vector :
Note that if , the direction defined by (12) provides a guaranteed descent direction for both of the losses.
The equiangular direction method converges to the Pareto-stationary point in finite number of iterations. If the sequence is infinite, then there exists subsequence that converges to the Pareto-stationary point.
If the EDM converges after iterations to the point such that , then there exists such that . Therefore, is a Pareto-stationary point since , where and . On the other hand, if the sequence is infinite, then the proof is the same as the proof of Theorem 2.3. from [desideri2012multiple] on the convergence of MGDA.
The proposed method is summarized in Algorithm 1.
2.2 Frank-Wolfe method to find
Since the feasible set in problem (8) is a simplex, this convex optimization problem can be efficiently solved by Frank-Wolfe method [frank1956algorithm, jaggi2013revisiting]. Note that the the objective function in problem (8) can be written as
where and is a matrix such that . Then, one iteration of the Frank-Wolfe method for solving problem (8) reduces to the following steps. The first step is computing the gradient of the objective function (13) and find the index of its smallest element. The second step is computing the optimal step size by solving auxiliary one-dimensional optimization problem:
where is the -th basis vector. The solution of problem (14) can be written in the closed form:
The third step is updating coefficients of convex combination as
For convenience we provide the detailed description of the Frank-Wolfe method to solve the problem (8) in Algorithm 2. Note that every iteration of the Frank-Wolfe method that solves problem (8) can be interpreted from the geometric perspective. In particular, the angle between the basis vector and gradient is maximum among angles between the gradient and the basis vectors, see line 4 in Algorithm 2.
3 Computational experiment
In this section, we compare the proposed method with other approaches to solve imbalanced classification problem and multi-task learning problem. As an example of the latter problem, we consider classification problem on the MultiMNIST dataset, which is a modification of the standard MNIST dataset [lecun1998gradient]
appropriate for multi-task learning. To compute the gradients of losses in the considered problems, we use automatic differentiation technique implemented in PyTorch framework[paszke2017automatic]. The source code can be found at GitHub111https://github.com/amkatrutsa/edm.
3.1 Imbalanced classification problem
Let be the dataset with classes, i.e. . Denote by a set of samples . Consider a function
that estimates labelof a sample . We need to find a parameter such that the classification quality will be as high as possible. To measure classification quality, the loss function is introduced and is minimized with respect to . This loss function is typically written as
where is the loss corresponding to the -th class:
where is a set of indices such that if then and is a loss for a given pair . We use cross-entropy loss function and represent
as a neural network. Therefore, the vectoris composed of the stacked vectorized parameters of this neural network.
The classification problem is called imbalanced if there exists class label such that , for all . In other words, the number of samples from the class is significantly smaller than the number of samples from the other classes. In the case of binary imbalanced classification, where , the class is called minor and the other class is called major. Further, we always refer the label of the major class as and the label of the minor class as . If one always assigns to any sample the label , the standard accuracy computed over all samples will be close to 1. However, it means that the samples with ground-truth label are always misclassified. To address this issue, a class weight is introduced to balance and . Denote by the weight corresponding to the minor class. Then the total loss function can be re-written in the form
The higher the value of
, the higher classification accuracy in the minor class is expected. However, the hyperparameterhas to be tuned to balance accuracies of the major and minor classes. To avoid this tuning, EDM can be used to automatically balance and without introducing additional hyperparameter.
To compare considered methods in the imbalanced classification problem, we use the credit card transaction dataset [dal2017credit], where the minor class consists of fraud transactions. The number of samples in this dataset is , and the number of features is . Note that number of fraud transactions is only , which is of the total number of samples. Thus, we have an imbalanced binary classification problem.
To demonstrate how EDM adjusts accuracies in the imbalanced classification problem, we compare EDM with the vanilla SGD method that minimizes the total loss (17) for and . More advanced gradient-based methods still suffer from the necessity of tuning hyperparameter
. Therefore, we compare EDM with the vanilla SGD method only. We consider the simple neural network with two fully-connected layers, ReLU activation between them and hidden dimension equal to
. The entire dataset is split in train and test sets such that the portion of the minor class is the same. The numbers of samples in the train and test sets are 227845 and 56962, respectively. Since the classes are imbalanced, we generate batches for them separately. The batch sizes are different, but during every epoch 40 batches of every class are used in the training process. We test different learning ratesin considered methods. Since the smaller learning rate induces the larger number of epochs, we use different numbers of epochs in training. In particular, learning rates and in all considered methods require epochs for the convergence. In the case of learning rate , SGD with and converges after epochs, but EDM and MGDA require and epochs for convergence, respectively.
Table 1 presents the test accuracies separately for the minor and major classes. It shows that EDM gives a reasonable trade-off between accuracies on such imbalanced classes without any tuning of hyperparameter. Moreover, EDM is robust to a range of step sizes and preserves balanced accuracies for individual classes. Also, EDM gives higher accuracy for the major class and slightly smaller or equal accuracy for the minor class compare with MGDA, see Tables 1(a) and 1(b). In the case of using learning rate , EDM provides higher accuracies for both major and minor classes than MGDA.
Test accuracies for both classes given by the considered methods. The reported mean values and standard deviations are computed from three random initializations of the considered model.
3.2 Multi-task learning problem
In this section, the standard classification problem is reduced to the multi-task learning (MTL) problem following the study [sener2018multi]. To test the presented method in solving the MTL problem, we consider the MultiMNIST dataset and adaptation of the LeNet neural network [lecun1989backpropagation]. MultiMMIST dataset is a modification of the classical MNIST dataset [lecun1998gradient]. Every image from the MultiMNIST is composed of two MNIST images: one image is placed in the top-left corner, and the other one is placed in the bottom-right corner [sabour2017dynamic, sener2018multi]. To make overlaying consistent, we create an image of size , place digits from the original MNIST dataset to the opposite corners and finally scale it to the standard size . Samples from MultiMNIST are shown in Figure 2.
Now we have the MTL problem, where the first task is to classify an image in the top-left corner, and the second task is to classify an image in the bottom-right corner. To solve this MTL problem, the following modification of the LeNet architecture[lecun1989backpropagation] is presented in [sener2018multi] and is used in this study, see Figure 3. Shared layers generate a representation of every image that is used to solve both tasks. Task-specific layers are responsible for solving a particular task, use representation, constructed by the shared layers, and do not affect the solution of the other task.
Following the work [sener2018multi], we update parameters in shared and task-specific layers differently. The parameters of the shared layers are updated based on the multi-objective optimization methods since these parameters affect losses corresponding to both tasks simultaneously. The parameters of the task-specific layers are updated with the SGD method since these parameters affect only single-task losses.
In the presented experiments, we illustrate the robustness of the EDM to the multi-scale losses in contrast to the MGDA method. Denote by and cross-entropy losses for the first and the second tasks, respectively. To control the scale of the loss , we introduce a hyper-parameter that is multiplied by the loss . In this setting we minimize the losses and concurrently. The larger , the larger loss . In particular, if , then losses and are minimized as they are. Note that the magnitudes of both losses are approximately the same for the considered dataset and neural network. At the same time, if , then the loss becomes ten times larger. We expect that in this case, EDM ensures more robust training than MGDA and, consequently, higher test accuracy. Also, we compare EDM with the single-task approach [sener2018multi]. This approach is based on two identical task-specific neural networks such that every neural network solves a particular task, i.e., classification of the top-left or the bottom-right image. The architecture of these neural networks coincide with the LeNet modification in Figure 3, but including only one block of task-specific layers since every network solves only one task. The single-task neural networks are trained with the vanilla SGD method for a fair comparison with EDM and MGDA.
Table 2 presents the test accuracy obtained by the considered methods for and . In this experiment we use learning rate , batch size and epochs. We show in Table 2(a) that even for EDM is more or equally accurate in both tasks compared with MGDA and the single-task approach. The desired property of the multi-objective optimization method to be robust to different scales of individual losses is more clear from Table 2(b). This table corresponds to the setting, where loss is multiplied by the factor . Test accuracy in both classes given by EDM are significantly higher than test accuracy corresponding to MGDA and single teask approach. Naturally, the single-task approach gives the same test accuracies for the top-left class for both values of , since the corresponding loss is unchanged.
4 Related works
Multi-objective optimization problems come from many applications, where the target vector is evaluated by more than one loss function. Examples of such applications are computer networks [donoso2016multi], energy saving [cui2017multi], engineering applications [chiandussi2012comparison, marler2004survey]
, etc. In machine learning and deep learning, models are typically trained by minimizing some pre-defined loss function computed on the training dataset. However, the quality of the trained model is additionally evaluated according to external criteria. For example, the coefficient of determination in regression problems estimates the ratio of the dependent variable variance that is explained by the trained model[helland1987interpretation]. In classification problems, AUC score close to one indicates the high quality of the trained model [vanderlooy2008critical]. In deep learning, neural networks for image classification can be evaluated on the robustness against the adversarial attacks [yuan2019adversarial, tursynbek2020geometry]. Also, recently proposed neural ODE models can be compared not only based on the primal quality measure but also based on the smoothness of the trained dynamic [gusak2020towards]. Although, most of the external criteria to evaluate the quality of models depend on the discrete variables, gradient-free methods to solve multi-objective discontinuous optimization problems are proposed [deb2001multi]
. One of the most common approaches are genetic algorithms[deb2002fast, von2014survey, abraham2005evolutionary]
, particle swarm optimization methods[wang2009particle]
and other nature-inspired heuristics[coello2009advances, omkar2011artificial]. These methods randomly explore the search space to find Pareto-optimal points and mostly suffer from the absence of any guarantees on the convergence to the Pareto-optimal point.
We focus on the unconstrained multi-objective optimization problems, where the loss functions are differentiable. Methods to solve such problems can use gradients of individual losses. Besides the weighting sum method that is typically used in applications [liu2016gradient], the modifications of the standard methods for single-objective optimization problems are proposed. For example, the steepest descent method for multi-objective optimization problems is proposed in [fliege2000steepest]. This method requires solving the auxiliary optimization problem in every iteration to get descent direction, which is similar to MGDA. Extension of this approach to the problems with box constraints is presented in [miglierina2008box]. Also, there is given the interpretation of multi-objective optimization problems from the dynamical system theory perspective. Further, the proximal method to solve multi-objective optimization problems is proposed in [bonnel2005proximal], but without any numerical comparison with other methods. One more approach to solving multi-objective problems is the generalized homotopy approach [hillermeier2001generalized]. It represents Pareto-optimal points as a differentiable manifold and generates new Pareto-optimal points through numerical evaluation of a local chart of this manifold.
This study considers multi-objective optimization problems, where loss functions are of different scales. To solve problems with such property, we propose the Equiangular Direction Method (EDM) and proof that it guarantees equal relative decrease of every loss function. Thus, EDM is robust to multi-scale losses. We illustrate the performance of the EDM in solving the imbalanced classification and multi-task learning problems. The proposed method provides the highest test accuracy compared with other approaches to solve considered problems.