A multiobjective optimization problem (MOP) can be defined as follows:
where is the decision space, is composed of real-valued objective functions where is called the objective space, and for is the -th objective of the MOP. Since different objectives in the MOP are usually conflicting, it is impossible to find one best solution that can optimize all objectives at the same time. Thus a trade-off is required among different objectives.
Let , is said to dominate if and only if for every and for at least one index . A solution is called a pareto optimal solution if there is no solution such that dominates ) . The set of all pareto optimal solutions is named as pareto set (PS), and the set is called the pareto front (PF) .
Many MOPs are NP-hard, such as multiobjective travelling salesman problem (MOTSP), multiobjective vehicle routing problem, etc. It is often difficult to find the PF of a MOP using exact algorithms. There are mainly two categories of optimization algorithms for solving MOPs. The first category is heuristics, such as NSGA- and MOEA/D . The second category is the learning heuristic based methods . Heuristics are often used to solve MOPs [18, 19, 3, 21], but there are several drawbacks for them. Firstly, it is time-consuming for heuristics to approximate the PF of a MOP. Secondly, once there is a slight change of the problem, the heuristic may need to re-perform again to compute the solutions . As a problem-specific method, heuristics often need to be revised for different problems, even for the similar ones.
. Instead of designing specific heuristics, DRL learn heuristics directly from data on end-to-end neural network. Taking travelling salesman problem (TSP) as an example, givencities as input, the aim is to get a sequence of these cities with minimum tour length. DRL views the problem as a Markov decision problem. Then TSP can be formulated as follows: the state is defined by the features of the partial solution and unvisited cities, the action is represented by the selection of the next city, the reward is the negative path length of a solution, the policy is the heuristic that learning how to make decisions, which is parameterized by a neural network. The aim of DRL is to train the policy that maximizes the reward. Once a policy is trained, the solution can be generated directly from one feed forward pass of the trained neural network. Without repeatedly solving instances from the same distribution, DRL is more efficient and requires much less problem-specific expert knowledge than heuristics.
Inspired by MOEA/D and the DRL methods proposed recently, a deep reinforcement learning multiobjective optimization algorithm (DRL-MOA)  is proposed to learn heuristics for solving MOPs. In the DRL-MOA, MOTSP is decomposed to single-objective optimization subproblems firstly. Then modified pointer networks, each of them is similar to the pointer network in , are used to model these subproblems. Finally, these models are trained by REINFORCE algorithm  sequentially. The experiment results on MOTSP in  show that DRL-MOA achieves better performance than NSGA- and MOEA/D.
As we know, MOTSP is defined on a graph, every node in the graph not only contains its own features, but also the graph structure features such as the distances from other nodes. In DRL-MOA, when the modified pointer network models the subproblems of MOTSP, it does not consider the graph structure features of the graph. Therefore, this paper proposes a multiobjective deep reinforcement learning algorithm using decomposition and attention model (MODRL/D-AM) to solve MOPs. Attention model can extract the node features as well as graph structure features of MOP instances, which is helpful in making decisions. To show the effectiveness of our method, MODRL/D-AM is compared with DRL-MOA for solving MOTSP, and a significant improvement is observed in the overall performance of convergence and diversity.
The remainder of our paper is organized as follows. In Section 2, DRL-MOA is described. MODRL/D-AM is introduced in Section 3. Experiment results and analysis are presented in Section 4. Finally, conclusions are given in Section 5.
2 Brief Review of DRL-MOA for MOTSP
2.1 Problem Formulation and Framework
We focus on MOTSP in this paper. Given cities and objective functions, and the -th objective function of MOTSP is formulated as follows :
where route is a permutation of cities and is the -th cost from city to city . The goal of MOTSP is to find a set of routes that minimize the objective functions simultaneously.
Just like MOEA/D, DRL-MOA decomposes MOTSP to scalar optimization subproblems by the well-known weighted sum approach, which considers the combination of different objectives. Let , where and
, be a weight vector corresponding to the-th scalar optimization subproblem of MOTSP, which is defined as follows:
The optimal solution of the scalar optimization problem above is a pareto optimal solution. Then, let be a set of weight vectors, each weight vector corresponds to a scalar optimization subproblem. When , the weight vectors and corresponding subproblems are spread uniformly as in Fig. 1 (a). The PS is made up of the non-dominated solutions of all subproblems.
After decomposing MOTSP to a set of scalar optimization subproblems, each subproblem can be modelled by a neural network and solved by DRL methods. However, training models requires huge amount of time. Thus, to decrease the training time of the models, DRL-MOA adopts a neighborhood-based transfer strategy, which shows in Fig. 1 (b). Each model corresponds to a subproblem. When one subproblem is solved, the parameters of the corresponding model will be transferred to the model of the neighborhood subproblem, then the neighborhood subproblem will be solved quickly. By making use of the neighborhood information among subproblems, all subproblems are tackled sequentially in a quick manner. The basic idea of DRL-MOA is shown in Algorithm 1. The subproblems are solved sequentially and models are trained with REINFORCE algorithm by combining DRL and neighborhood-based transfer strategy. Finally, the PF can be approximated by a simple feed forward calculation of the models.
2.2 Model of Subproblem: Pointer Network
A subproblem instance of MOTSP can be defined in a graph with nodes, which is denoted by a set . Each node has a feature vector , which corresponds to the different objectives of MOTSP. For example, a feature used widely is the 2-dimensional coordinate of Euclidean space. The solution denoted by is a permutation of the graph nodes of MOTSP. The objective is minimizing the weighted sum of different objectives like Eq. (3). The process of generating a solution can be viewed as a sequential decision process, so each subproblem can be solved by an encoder-decoder model  parameterized by . Firstly, the encoder maps the node features to node embeddings in a high-dimensional vector space. Then the decoder generates the solution step by step. At each decoding step , one node
In DRL-MOA, a modified pointer network is used to compute the probability in Eq. (4). The encoder of the modified pointer network transforms each node feature to an embedding in a high-dimensional vector space through a 1-dimensional (1-D) convolution layer. At each decoding time
, a gated recurrent unit (GRU) and a variant of attention mechanism 
are used to produce a probability distribution over the unvisited nodes, which is used to select the next node to visit. More details of the modified pointer network can be found in.
3 The Proposed Algorithm: MODRL/D-AM
In DRL-MOA, a modified pointer network is used to model the subproblem of MOTSP. In the modified pointer network, an encoder extracts the node features using a simple 1-D convolutional layer. However, each subproblem of MOTSP is defined over a graph that is fully-connected (with self-connections). Such a simple encoder can not exploit the graph structure of a problem instance. At the decoding time , the decoder uses a GRU to map a partial tour to a hidden state, which is used as decoding context to calculate the probability distribution of selecting the next node. However, the partial tour can not be changed and our goal is to construct a path from to through all unvisited nodes. In other words, the selection of the next node is relevant only to the first and last node of the partial tour. Using a GRU in modified pointer network to map the total partial path to a hidden state may be not so helpful in selecting the next node, since there is much irrelevant information in the hidden state. Thus, this paper uses the attention model , instead of the pointer network, to model the subproblem.
3.2 Model of Subproblem: Attention Model
The attention model is also an encoder-decoder model. However, different from the modified pointer network, the encoder of attention model can be viewed as a graph attention network , which is used to compute the embedding of each node. As show in Fig. 2 (a), by attending over other nodes, the embedding of each node contains the node features as well as the structure features. The decoder of attention model does not use a GRU to summarize the total partial path to a decoding context vector. Instead, the decoding context vector is calculated using the graph embedding, the first and last node embeddings of the partial tour, which is more useful in selecting the next node. The details of attention model are described below.
3.2.1 Encoder of Attention Model
The encoder of attention model transforms each node feature vector in the -dimensional vector space to a node embedding in the
-dimensional vector space. The encoder is consisted of a linear transformation layer andattention layers, which is similar to the encoder used in the Transformer architecture . But the encoder of attention model does not use the positional encoding since the input order is not meaningful. For each node , where , the linear transformation layer with parameters and transforms the node feature vector to the initial node embedding :
Then the node embeddings are fed into
attention layers. Each attention layer contains a multi-head attention sublayer and a feed-forward sublayer. For each sublayer, a batch normalization layer and a skip connection  layer are used to accelerate the training process.
Multi-Head Attention Sublayer
For each node , this sublayer is used to aggregate different types of message from other nodes in the graph. Let the embedding of each node in layer be , where and . The output of multi-head attention sublayer can be computed as follows:
where BN is the batch normalization layer and is the multi-head attention vector that contains different type of messages from other nodes. The number of heads is set to . For each head , the query vector , the key vector and the value vector is calculated by a transformation of the node embedding for each node (). Then the process of computing the multi-head attention vector is described as follows:
where are trainable attention weights of the -th multi-head attention sublayer. is the compatibility of the query vector of node with the key vector of node , the attention weight is calculated using a softmax function. is the combination of messages from other nodes received by node . The multi-head attention vector is computed with and .
Feed Forward Sublayer
In this sublayer, the node embedding of each node is updated by making use of the output of the multi-head attention layer. The feed forward sublayer (FF) is consisted of a fully-connected layer with ReLu activation function and another fully-connected layer. For each node, the input of the feed forward sublayer is the output of the multi-head attention layer , the output is calculated as follows:
where and are trainable parameters.
For each node , the final node embedding is calculated by attention layers. Besides that, the graph embedding is defined as follows:
both of the node embeddings and graph embedding will be passed to the decoder.
3.2.2 Decoder of Attention Model
At each decoding step , the decoder needs to make a decision of based on the partial tour , the embeddings of each node and the total graph. Firstly, the initial context embedding is calculated by a concatenation of the graph embedding , the node embedding of the first node and the last node . When , are replaced by two trainable parameter vectors :
Then a new context embedding is computed with an -head attention layer. The query vector comes from the previous context embedding . For each node , the key vector and the value vector are transformed from the node embedding :
where and . Then the compatibilities of the query vector with all nodes are computed. Different from the encoder of attention model, the nodes that have been visited are masked when calculating the compatibilities:
Then the attention weights can be obtained by a softmax function and the new context embedding can be calculated as follows:
where . Finally, based on the new context embedding , the probability of selecting node as the next node to visit is calculated by a single-head attention layer:
where and are trainable parameters. When we compute the compatibilities in Eq. (19), the result are limited in () by a tanh function.
The decoding process at decoding step is shown in Fig. 2 (b). Firstly, the context embedding is computed with a multi-head attention layer by making use of the partial solution and unvisited nodes. Then based on the context embedding and unvisited nodes, the probability distribution over unvisited nodes can be calculated by a single-head attention mechanism.
3.3 Framework and Training Method
The proposed algorithm uses the same MOEA/D framework as in DRL-MOA (Algorithm 1). The training method is briefly described as follows.
The REINFORCE algorithm, a well-know actor-critic training method, is used to train the model of the subproblem. For each subproblem, the training parameters is composed of an actor network and a critic network. The actor network is the attention model, which is parameterized by . The critic network parameterized by
has four 1-D convolutional layers to map the embeddings of a problem instance into a single value. The output of the critic network predicts an estimation of the objective function of the subproblem.
For the actor network, the training objective is the weighted sum of different objectives of solution of a problem instance . So the gradients of parameters can be defined as follows:
where is the objective function of the -th subproblem, which is the weighted sum of different objectives. is the corresponding weight vector.
is a baseline function calculated by the critic network, which estimates the expected objective value to reduce the variance of the gradients.
In the training process, the MOTSP instances are generated from distributions . Since for each node of an instance , different features may come from different distributions . For example, can be a two-dimensional coordinate in Euclidean space and
can be a uniform distribution of. Then the gradients of parameters can be approximated by Monte Carlo sampling as follows:
where is the batch size, is a problem instance sampled from and generated by the actor network is the solution of .
Different from the actor network, the critic network aims to learn to estimate the expected objective value given an instance . Hence, the objective function of the critic network can be a mean squared error function between the estimated objective value of the critic network and the actual objective value of the solution generated by the actor network. The objective function of the critic network is formulated as follows:
The training algorithm can be described in Algorithm 2.
4.1 Problem Instances and Experimental Settings
MODRL/D-AM is tested on the Euclidean instances in . In the Euclidean instances, the two node features are both sampled from and both of the two cost functions between node and node are the Euclidean distance between them.
To train the models of MODRL/D-AM, problem instances with 20 and 40 nodes are used. After training, two models of MODRL/D-AM are obtained and the influence of different nodes in the training process can be discussed. To show the robustness of our method, the models are tested on problem instances with 20, 40, 100, 150 and 200 nodes. Besides, kroAB100, kroAB150 and kroAB200 generated from TSPLIB  are used to test the performance of our method.
DRL-MOA is implemented and used as the baseline. Both our method and DRL-MOA are trained on datasets with 20 and 40 nodes, so there are four models in total: MODRL/D-AM (20), DRL-MOA (20), MODRL/D-AM (40), DRL-MOA (40). To make the result comparison more convincing, some parameters of our method and the baseline are set to the same value. The number of subproblems is set to 100, the input dimension is set to 4, the dimension of node embedding is set to 128. In the training process, the batch size is set to 200, the size of problem instances
is set to 500000, and the model of the first subproblem is trained for 5 epochs and each model of the remaining subproblems is trained for 1 epoch. Besides these parameters, the critic network is consisted of four 1-D convolutional layers. The input channels and output channels of the four convolutional layers are (4, 128), (128, 20), (20, 20) and (20, 1), where the first element of a tuple represents the input channel and the second element represents the output channel. For all convolutional layers, the kernel size and stride are set to 1.
In MODRL/D-AM, the number of attention layers is set to 1, the number of heads is set to 8, the dimension of the query vector and the value vector are both set to = 16, and another dimension in the feed forward sublayer is set to 512.
4.2 Results and Discussions
Hypervolume (HV) indicator is calculated to compare the performance of our method and DRL-MOA on tested instances. When computing the HV value, the objective values are normalized and the reference point is set to . The PFs obtained by MODRL/D-AM and DRL-MOA are also compared. Besides, the influence of different number of nodes in training process is also discussed. All test experiments are conducted by a GPU (GeForce RTX 2080Ti).
|#nodes||MODRL/D-AM (20)||DRL-MOA (20)||MODRL/D-AM (40)||DRL-MOA (40)|
|#nodes||MODRL/D-AM (20)||DRL-MOA (20)||MODRL/D-AM (40)||DRL-MOA (40)|
The HV values of random instances are shown in Table 1. For the random instances with 20, 40, 70, 100, 150 and 200 nodes, 10 instances are tested for each kind of random instances. The average of the HV values of each kind of random instances is calculated. In terms of the average of HV values, MODRL/D-AM (40) performs better than DRL-MOA (40) in all kinds of random instances. For kroAB100, kroAB150 and kroAB200, the HV values are computed in Table 2 and MODRL/D-AM (40) achieves a better performance than DRL-MOA (40). The calculation time of our method is longer than that of DRL-MOA. It is reasonable because the graph attention encoder of attention model requires more calculation resources than a single convolutional layer.
The result of the tested instances with different nodes is shown in Fig. 3. By increasing the number of nodes, MODRL/D-AM (40) is able to get better performance in terms of convergence and diversity than that of DRL-MOA (40). Fig. 4 shows the performance of MODRL/D-AM (40) and DRL-MOA (40) on kroAB100, kroAB150 and kroAB200 instances. A significant improvement on convergence is observed for our method and the diversity achieved by our method is also slightly better.
Then, the performances of MODRL/D-AM (40) and MODRL/D-AM (20) are compared to investigate the influence of different number of nodes in training process. HV values in Table 1 show that MODRL/D-AM (40) performs better on random instances with 40, 70, 100, 150 and 200 nodes than MODRL/D-AM (20). For the random instances with 20 nodes, MODRL/D-AM (40) performs similar to MODRL/D-AM (20), while MODRL/D-AM (40) is slightly worse. From the PFs obtained by MODRL/D-AM (40) and MODRL/D-AM (20) in Fig. 5, a better performance is observed in terms of convergence and diversity. When training with instances with larger number of nodes, the model of MODRL/D-AM can learn to deal with more complex information about node features and structure features. Thus, a better model of MODRL/D-AM can be trained with more nodes.
From the experiment results above, it is observed that MODRL/D-AM has a good generalization performance in solving MOTSP. For MODRL/D-AM, the model trained with 40 nodes can be used to approximate the PF of problem instances with 200 nodes. In terms of convergence and diversity, MODRL/D-AM performs better than DRL-MOA.
The good performance of MODRL/D-AM indicates that the graph structure features are helpful in constructing solutions for MOTSP, and attention model can extract the structure information of a problem instance effectively. Thus, MODRL/D-AM can also be applied to other similar combinatorial optimization problems with graph structures such as multiobjective vehicle routing problem[18, 19]. Finally, there is still an issue that the solutions of MOTSP instances are not distributed evenly in our experiment, which needs further research.
This paper proposes an multiobjective deep reinforcement learning algorithm using decomposition and attention model. MODRL/D-AM adopts an attention model to model the subproblems of MOPs. The attention model can extract structure features as well as node features of problem instances. Thus, more useful structure information is used to generate better solutions. MODRL/D-AM is tested on MOTSP instances, and compared with DRL-MOA which uses pointer network to model the subproblems of MOTSP. The results show MODRL/D-AM achieves better performance. A good generalization performance on different size of problem instances is also observed for MODRL/D-AM.
This work is supported by the National Key R&D Program of China
(2018AAA0101203), and the National Natural Science Foundation of China (61673403, U1611262).
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.2.
-  (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §1.
-  (2019) The collaborative local search based on dynamic-constrained decomposition with grids for combinatorial multiobjective optimization. IEEE Transactions on Cybernetics, pp. 1–12. Cited by: §1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.2, §2.2.
A fast and elitist multiobjective genetic algorithm: NSGA-II.
IEEE Transactions on Evolutionary Computation6 (2), pp. 182–197. Cited by: §1.
Learning heuristics for the TSP by policy gradient.
International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp. 170–181. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §3.2.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2.1.
-  (2017) Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348–6358. Cited by: §1.
-  (2018) Attention, learn to solve routing problems!. arXiv preprint arXiv:1803.08475. Cited by: §1, §3.1.
-  (2019) Deep reinforcement learning for multi-objective optimization. arXiv preprint arXiv:1906.02386. Cited by: §1, §1, §2.2, §4.1.
-  (2010) The multiobjective traveling salesman problem: a survey and a new approach. In Advances in Multi-Objective Nature Inspired Computing, pp. 119–141. Cited by: §2.1.
-  (2018) Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pp. 9839–9849. Cited by: §1.
-  (1991) TSPLIB–traveling salesman problem library. ORSA Journal on Computing 3 (4), pp. 376–384. Cited by: §4.1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.2.1.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3.2.
-  (2015) Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: §1.
A two-stage multiobjective evolutionary algorithm for multiobjective multidepot vehicle routing problem with time windows. IEEE Transactions on Cybernetics 49 (7), pp. 2467–2478. Cited by: §1, §4.2.
-  (2019) Multiobjective multiple neighborhood search algorithms for multiobjective fleet size and mix location-routing problem with time windows. IEEE Transactions on Systems, Man, and Cybernetics: Systems (), pp. 1–15. Cited by: §1, §4.2.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §1.
Set-based discrete particle swarm optimization based on decomposition for permutation-based multiobjective combinatorial optimization problems. IEEE Transactions on Cybernetics 48 (7), pp. 2139–2153. Cited by: §1.
-  (2007) MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 11 (6), pp. 712–731. Cited by: §1, §1.