1 Introduction
Many problems in machine learning and deep learning require the minimization of different loss functions simultaneously. Such problems appear, for example, in multitask learning
[zhang2017survey], where the single model is trained to be good at different (possibly conflicting) tasks. The survey of machine learning problem statements including multiple objective functions is presented in [jin2006multi]. In particular, an imbalanced classification problem can be solved efficiently by introducing several objective functions [soda2011multi]. Also, the feature selection problem is naturally multiobjective since the selected subset of features has to be nonredundant and gives accurate and stable trained model, simultaneously
[xue2012particle, katrutsa2015stress]. The exploiting of a multiobjective optimization approach in reinforcement learning problem is discussed in
[liu2014multiobjective, van2014multi].Denote by
loss functions in multiobjective optimization problem. These losses depend on the same vector
. We consider the case of differentiable loss functions . Since the considered losses can be conflicting, i.e. a minimizer of one loss is not a minimizer of another loss, the solution of the multiobjective minimization problem is defined in the following way.Definition 1 (Paretooptimal solution)

A point dominates for a multiobjective optimization problem, if for and at least one inequality is strict.

A point is called Paretooptimal solution, if there is no other point that dominates it.
One of the standard approaches to find Paretooptimal solution is scalarization [ghane2015new], e.g. weighting sum method that minimizes the conical combination of the losses:
(1) 
where are static or dynamically updated weights of each loss. The survey on the scalarization approach to solving multiobjective optimization problems is given in [miettinen2002scalarizing]. Note that the weighting sum method (1) gives different Paretooptimal points for different weights . The drawbacks of this approach to identify Paretooptimal points are discussed in [das1997closer].
Multiple gradient descent algorithm
Another approach to finding the Paretooptimal point is multiplegradient descent algorithm (MGDA) [desideri2012multiple]. The main idea of this algorithm is to find in every iteration descent direction for the considered losses . The procedure to find this direction is based on the necessary condition for the point to be a Paretooptimal solution. This condition is called Pareto stationarity.
Definition 2 (Pareto stationarity)
Given differentiable losses , the point is called Paretostationary, iff there exists a convex combination of gradients equals to zero, i.e.
(2) 
Based on this necessary condition, the descent direction in MGDA is a convex combination of gradients whose norm is minimal. Thus, the following optimization problem has to be solved
(3) 
where . Denote by the resulting convex combination of , i.e. . If , then is a Paretostationary point. Otherwise, direction is a descent direction for the losses [desideri2012multiple]. Therefore, MGDA updates vector , where is the iteration number, similar to the gradient descent for a singleobjective optimization problem:
(4) 
where is a learning rate. Figure 1 illustrates the position of the direction in the case of two gradients and .
This choice of the descent direction leads to the convergence to the Paretostationary point, such that it is the closest to the initial point among all other Paretostationary points. Thus, in contrast to the weighting sum method (1), MGDA converges to a particular Paretostationary point.
One more property of the direction is the following (see Theorem 2.2. in [desideri2012multiple]): if this direction belongs to the interior of the convex hull of the gradients , it satisfies
(5) 
It means that each loss gets approximately the same absolute decrease after updating according to (4) since
where is the direction in the th iteration.
This is the starting point for our method: if the individual losses are not balanced (e.g. lets multiply one loss by a large factor), the Pareto front does not change, however the solution of (3) changes significantly, compare in Figures 0(a) and 0(b). Instead, we propose to look for the direction that gives the same relative decrease:
(6) 
or in other words, the direction has the same angle to the gradients (see Theorem 1). Therefore, the proposed method is called Equiangular Direction Method (EDM). For two gradients , the direction is a bisector of the angle between , see Figure 1.
2 Equiangular Direction Method (EDM)
The key step of the EDM is to find the direction that has the same angle with individual gradients . This direction can be naturally written as a convex combination of the normalized gradients:
(7) 
where the coefficients are the solution of the following convex optimization problem:
(8) 
The following Theorem shows the main characteristic property of this direction.
Theorem 1
The direction defined in (7) satisfies the following equality
for all such that . If for all , the direction has the same angle with all individual gradients, i.e. this direction and gradients are equiangular with some angle .
Proof
Let be the index set such that is equivalent to . After that, the problem (8) can be rewritten as
The solution uniquely defines in the following way:
(9) 
Now we can write Lagrangian and derive KKT optimality conditions for this optimization problem. Note that the constraint is used in KKT conditions implicitly since by construction there exists solution that satisfies this constraint. Therefore, the Lagrangian is the following function . From the KKT optimality conditions follows that the gradient of with respect to has to be zero in and for :
since only indices from the set correspond to nonzero elements of (9). Thus we get the equality for any index such that :
(10) 
Now we can compute the value of the remaining factor . Consider the following chain of equalities:
Thus, the equality (10) can be rewritten in the final form:
where is any index such that .
2.1 Normalization of the equiangular direction
To define the direction only angles between gradients are important. But to define the proper norm of this direction, we need additional assumptions. If we have only two gradients and and their norms are equal, then it is natural to require that the normalized vector coincides with the vector , i.e., the vector has to belong to the convex hull of gradients and . To satisfy this requirement, the scale factor and the corresponding vector have the following forms:
(11) 
Remark 1
For there is an explicit formula for the bisector direction:
that is formally not equal to the solution of the problem (8), but only up to a normalization factor. Now to get the normalized vector that belongs to the convex hull of the gradients we use the scale factor and obtain the final form of the vector :
(12) 
Note that if , the direction defined by (12) provides a guaranteed descent direction for both of the losses.
Theorem 2
The equiangular direction method converges to the Paretostationary point in finite number of iterations. If the sequence is infinite, then there exists subsequence that converges to the Paretostationary point.
Proof
If the EDM converges after iterations to the point such that , then there exists such that . Therefore, is a Paretostationary point since , where and . On the other hand, if the sequence is infinite, then the proof is the same as the proof of Theorem 2.3. from [desideri2012multiple] on the convergence of MGDA.
The proposed method is summarized in Algorithm 1.
2.2 FrankWolfe method to find
Since the feasible set in problem (8) is a simplex, this convex optimization problem can be efficiently solved by FrankWolfe method [frank1956algorithm, jaggi2013revisiting]. Note that the the objective function in problem (8) can be written as
(13) 
where and is a matrix such that . Then, one iteration of the FrankWolfe method for solving problem (8) reduces to the following steps. The first step is computing the gradient of the objective function (13) and find the index of its smallest element. The second step is computing the optimal step size by solving auxiliary onedimensional optimization problem:
(14) 
where is the th basis vector. The solution of problem (14) can be written in the closed form:
(15) 
The third step is updating coefficients of convex combination as
For convenience we provide the detailed description of the FrankWolfe method to solve the problem (8) in Algorithm 2. Note that every iteration of the FrankWolfe method that solves problem (8) can be interpreted from the geometric perspective. In particular, the angle between the basis vector and gradient is maximum among angles between the gradient and the basis vectors, see line 4 in Algorithm 2.
3 Computational experiment
In this section, we compare the proposed method with other approaches to solve imbalanced classification problem and multitask learning problem. As an example of the latter problem, we consider classification problem on the MultiMNIST dataset, which is a modification of the standard MNIST dataset [lecun1998gradient]
appropriate for multitask learning. To compute the gradients of losses in the considered problems, we use automatic differentiation technique implemented in PyTorch framework
[paszke2017automatic]. The source code can be found at GitHub^{1}^{1}1https://github.com/amkatrutsa/edm.3.1 Imbalanced classification problem
Let be the dataset with classes, i.e. . Denote by a set of samples . Consider a function
that estimates label
of a sample . We need to find a parameter such that the classification quality will be as high as possible. To measure classification quality, the loss function is introduced and is minimized with respect to . This loss function is typically written as(16) 
where is the loss corresponding to the th class:
where is a set of indices such that if then and is a loss for a given pair . We use crossentropy loss function and represent
as a neural network. Therefore, the vector
is composed of the stacked vectorized parameters of this neural network.The classification problem is called imbalanced if there exists class label such that , for all . In other words, the number of samples from the class is significantly smaller than the number of samples from the other classes. In the case of binary imbalanced classification, where , the class is called minor and the other class is called major. Further, we always refer the label of the major class as and the label of the minor class as . If one always assigns to any sample the label , the standard accuracy computed over all samples will be close to 1. However, it means that the samples with groundtruth label are always misclassified. To address this issue, a class weight is introduced to balance and . Denote by the weight corresponding to the minor class. Then the total loss function can be rewritten in the form
(17) 
The higher the value of
, the higher classification accuracy in the minor class is expected. However, the hyperparameter
has to be tuned to balance accuracies of the major and minor classes. To avoid this tuning, EDM can be used to automatically balance and without introducing additional hyperparameter.To compare considered methods in the imbalanced classification problem, we use the credit card transaction dataset [dal2017credit], where the minor class consists of fraud transactions. The number of samples in this dataset is , and the number of features is . Note that number of fraud transactions is only , which is of the total number of samples. Thus, we have an imbalanced binary classification problem.
To demonstrate how EDM adjusts accuracies in the imbalanced classification problem, we compare EDM with the vanilla SGD method that minimizes the total loss (17) for and . More advanced gradientbased methods still suffer from the necessity of tuning hyperparameter
. Therefore, we compare EDM with the vanilla SGD method only. We consider the simple neural network with two fullyconnected layers, ReLU activation between them and hidden dimension equal to
. The entire dataset is split in train and test sets such that the portion of the minor class is the same. The numbers of samples in the train and test sets are 227845 and 56962, respectively. Since the classes are imbalanced, we generate batches for them separately. The batch sizes are different, but during every epoch 40 batches of every class are used in the training process. We test different learning rates
in considered methods. Since the smaller learning rate induces the larger number of epochs, we use different numbers of epochs in training. In particular, learning rates and in all considered methods require epochs for the convergence. In the case of learning rate , SGD with and converges after epochs, but EDM and MGDA require and epochs for convergence, respectively.Table 1 presents the test accuracies separately for the minor and major classes. It shows that EDM gives a reasonable tradeoff between accuracies on such imbalanced classes without any tuning of hyperparameter. Moreover, EDM is robust to a range of step sizes and preserves balanced accuracies for individual classes. Also, EDM gives higher accuracy for the major class and slightly smaller or equal accuracy for the minor class compare with MGDA, see Tables 1(a) and 1(b). In the case of using learning rate , EDM provides higher accuracies for both major and minor classes than MGDA.



Test accuracies for both classes given by the considered methods. The reported mean values and standard deviations are computed from three random initializations of the considered model.
3.2 Multitask learning problem
In this section, the standard classification problem is reduced to the multitask learning (MTL) problem following the study [sener2018multi]. To test the presented method in solving the MTL problem, we consider the MultiMNIST dataset and adaptation of the LeNet neural network [lecun1989backpropagation]. MultiMMIST dataset is a modification of the classical MNIST dataset [lecun1998gradient]. Every image from the MultiMNIST is composed of two MNIST images: one image is placed in the topleft corner, and the other one is placed in the bottomright corner [sabour2017dynamic, sener2018multi]. To make overlaying consistent, we create an image of size , place digits from the original MNIST dataset to the opposite corners and finally scale it to the standard size . Samples from MultiMNIST are shown in Figure 2.
Now we have the MTL problem, where the first task is to classify an image in the topleft corner, and the second task is to classify an image in the bottomright corner. To solve this MTL problem, the following modification of the LeNet architecture
[lecun1989backpropagation] is presented in [sener2018multi] and is used in this study, see Figure 3. Shared layers generate a representation of every image that is used to solve both tasks. Taskspecific layers are responsible for solving a particular task, use representation, constructed by the shared layers, and do not affect the solution of the other task.Following the work [sener2018multi], we update parameters in shared and taskspecific layers differently. The parameters of the shared layers are updated based on the multiobjective optimization methods since these parameters affect losses corresponding to both tasks simultaneously. The parameters of the taskspecific layers are updated with the SGD method since these parameters affect only singletask losses.
In the presented experiments, we illustrate the robustness of the EDM to the multiscale losses in contrast to the MGDA method. Denote by and crossentropy losses for the first and the second tasks, respectively. To control the scale of the loss , we introduce a hyperparameter that is multiplied by the loss . In this setting we minimize the losses and concurrently. The larger , the larger loss . In particular, if , then losses and are minimized as they are. Note that the magnitudes of both losses are approximately the same for the considered dataset and neural network. At the same time, if , then the loss becomes ten times larger. We expect that in this case, EDM ensures more robust training than MGDA and, consequently, higher test accuracy. Also, we compare EDM with the singletask approach [sener2018multi]. This approach is based on two identical taskspecific neural networks such that every neural network solves a particular task, i.e., classification of the topleft or the bottomright image. The architecture of these neural networks coincide with the LeNet modification in Figure 3, but including only one block of taskspecific layers since every network solves only one task. The singletask neural networks are trained with the vanilla SGD method for a fair comparison with EDM and MGDA.
Table 2 presents the test accuracy obtained by the considered methods for and . In this experiment we use learning rate , batch size and epochs. We show in Table 2(a) that even for EDM is more or equally accurate in both tasks compared with MGDA and the singletask approach. The desired property of the multiobjective optimization method to be robust to different scales of individual losses is more clear from Table 2(b). This table corresponds to the setting, where loss is multiplied by the factor . Test accuracy in both classes given by EDM are significantly higher than test accuracy corresponding to MGDA and single teask approach. Naturally, the singletask approach gives the same test accuracies for the topleft class for both values of , since the corresponding loss is unchanged.


4 Related works
Multiobjective optimization problems come from many applications, where the target vector is evaluated by more than one loss function. Examples of such applications are computer networks [donoso2016multi], energy saving [cui2017multi], engineering applications [chiandussi2012comparison, marler2004survey]
, etc. In machine learning and deep learning, models are typically trained by minimizing some predefined loss function computed on the training dataset. However, the quality of the trained model is additionally evaluated according to external criteria. For example, the coefficient of determination in regression problems estimates the ratio of the dependent variable variance that is explained by the trained model
[helland1987interpretation]. In classification problems, AUC score close to one indicates the high quality of the trained model [vanderlooy2008critical]. In deep learning, neural networks for image classification can be evaluated on the robustness against the adversarial attacks [yuan2019adversarial, tursynbek2020geometry]. Also, recently proposed neural ODE models can be compared not only based on the primal quality measure but also based on the smoothness of the trained dynamic [gusak2020towards]. Although, most of the external criteria to evaluate the quality of models depend on the discrete variables, gradientfree methods to solve multiobjective discontinuous optimization problems are proposed [deb2001multi]. One of the most common approaches are genetic algorithms
[deb2002fast, von2014survey, abraham2005evolutionary], particle swarm optimization methods
[wang2009particle]and other natureinspired heuristics
[coello2009advances, omkar2011artificial]. These methods randomly explore the search space to find Paretooptimal points and mostly suffer from the absence of any guarantees on the convergence to the Paretooptimal point.We focus on the unconstrained multiobjective optimization problems, where the loss functions are differentiable. Methods to solve such problems can use gradients of individual losses. Besides the weighting sum method that is typically used in applications [liu2016gradient], the modifications of the standard methods for singleobjective optimization problems are proposed. For example, the steepest descent method for multiobjective optimization problems is proposed in [fliege2000steepest]. This method requires solving the auxiliary optimization problem in every iteration to get descent direction, which is similar to MGDA. Extension of this approach to the problems with box constraints is presented in [miglierina2008box]. Also, there is given the interpretation of multiobjective optimization problems from the dynamical system theory perspective. Further, the proximal method to solve multiobjective optimization problems is proposed in [bonnel2005proximal], but without any numerical comparison with other methods. One more approach to solving multiobjective problems is the generalized homotopy approach [hillermeier2001generalized]. It represents Paretooptimal points as a differentiable manifold and generates new Paretooptimal points through numerical evaluation of a local chart of this manifold.
5 Conclusion
This study considers multiobjective optimization problems, where loss functions are of different scales. To solve problems with such property, we propose the Equiangular Direction Method (EDM) and proof that it guarantees equal relative decrease of every loss function. Thus, EDM is robust to multiscale losses. We illustrate the performance of the EDM in solving the imbalanced classification and multitask learning problems. The proposed method provides the highest test accuracy compared with other approaches to solve considered problems.
Comments
There are no comments yet.