A Deep Reinforcement Learning Approach for Constrained Online Logistics Route Assignment

by   Hao Zeng, et al.

As online shopping prevails and e-commerce platforms emerge, there is a tremendous number of parcels being transported every day. Thus, it is crucial for the logistics industry on how to assign a candidate logistics route for each shipping parcel properly as it leaves a significant impact on the total logistics cost optimization and business constraints satisfaction such as transit hub capacity and delivery proportion of delivery providers. This online route-assignment problem can be viewed as a constrained online decision-making problem. Notably, the large amount (beyond 10^5) of daily parcels, the variability and non-Markovian characteristics of parcel information impose difficulties on attaining (near-) optimal solution without violating constraints excessively. In this paper, we develop a model-free DRL approach named PPO-RA, in which Proximal Policy Optimization (PPO) is improved with dedicated techniques to address the challenges for route assignment (RA). The actor and critic networks use attention mechanism and parameter sharing to accommodate each incoming parcel with varying numbers and identities of candidate routes, without modeling non-Markovian parcel arriving dynamics since we make assumption of i.i.d. parcel arrival. We use recorded delivery parcel data to evaluate the performance of PPO-RA by comparing it with widely-used baselines via simulation. The results show the capability of the proposed approach to achieve considerable cost savings while satisfying most constraints.



There are no comments yet.


page 7


Solving the Order Batching and Sequencing Problem using Deep Reinforcement Learning

In e-commerce markets, on time delivery is of great importance to custom...

Visual Explanation using Attention Mechanism in Actor-Critic-based Deep Reinforcement Learning

Deep reinforcement learning (DRL) has great potential for acquiring the ...

Energy Minimization in UAV-Aided Networks: Actor-Critic Learning for Constrained Scheduling Optimization

In unmanned aerial vehicle (UAV) applications, the UAV's limited energy ...

Re-route Package Pickup and Delivery Planning with Random Demands

Recently, a higher competition in logistics business introduces new chal...

A Model-free Deep Reinforcement Learning Approach To Maneuver A Quadrotor Despite Single Rotor Failure

Ability to recover from faults and continue mission is desirable for man...

Reinforcement Learning for Assignment Problem with Time Constraints

We present an end-to-end framework for the Assignment Problem with multi...

DeepFreight: A Model-free Deep-reinforcement-learning-based Algorithm for Multi-transfer Freight Delivery

With the freight delivery demands and shipping costs increasing rapidly,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In recent years, the scale of parcels has gained fast increasing due to the development of e-commerce platforms and the popularization of online shopping. As an example, millions of parcels are being delivered within each country or across countries in South-east Asia, which brings forth the urge of elevating informatization and efficiency of delivery for logistics industry. Usually, several logistics routes are available for each parcel to transport and each route is composed of one to many delivery providers. All providers in a certain route are responsible for their own transportation section within that route. In logistics industry, one of the fundamental problems is how to arrange a proper route for each sequentially incoming parcel so as to fulfill requirements or goals from business strategies. The problem is referred to as the online route-assignment problem.

Figure 1: Incoming parcels and their candidate logistics routes in an online route-assignment task.

As shown in Figure 1, a logistics route usually consists of the following elements. i. section: a route can be split into several logistics sections, e.g., first-mile or last-mile section. ii. provider: each section is corresponding with a transportation (or delivery) provider to fulfill the delivery task; iii. transit hub: each transit hub has one or several providers to drop off the parcels. In each section, the provider always has one hub to make delivery. iv. cost: each route has a corresponding transportation cost of all sections within the route. A route’s identity can be represented by the concatenation of all providers’ names in that route. A parcel to be delivered usually contains several candidate routes and the number of routes is varying. Besides, the identities of candidate routes for different parcels may also vary since there is a number of available providers and the possible combinations are varying. Also, the costs of routes may be different when comparing the candidate routes with identical identity among different parcels. That is mainly because the pairs of origin addresses and destination addresses (i.e., OD pairs) are different among parcels.

The online route-assignment problem introduced above can also be viewed as an online decision-making problem, in which the decision is to determine the route to assign while considering objective and constraints converted from business goals which will be described later. The challenges regarding to the problem lie in the following aspects.

  • The amount and variability of decision-making tasks. The total amount of parcels to assign reaches per day per country in South-east Asia, which is a relatively large number for many online decision-making algorithms to attain promising results. Besides, the varying numbers and identities of candidate routes among parcels increases difficulty of assignment.

  • The online and constrained characteristics of decision-making problem. The decisions are made based on limited known information, trying to meet the objective while trying not to violate the constraints. The total amount of parcels and information of future incoming parcels are difficult to predict, which will remain unobservable until the end of period or until the accomplishment of assigning for the last parcel.

  • The non-Markovian characteristics of incoming parcels. The next incoming parcel, with its candidate routes, is independent of the decision made for the last parcel. However, the status of constraints (e.g., the remaining capacity of hubs) follows Markovian dynamics.

The most basic method for solving this problem is greedy method, which always assigns the route with lowest cost in all candidate routes for each incoming parcel without considering additional parcel information. However, the serious violation of constraints makes it inappropriate for application. On the other hand, several deep reinforcement learning (DRL) approaches with the capability of extracting potentially valuable characteristics from given features shed a light on solving the problem. Those DRL approaches with neural networks (NN) as function approximators achieve a massive success on games playing, which include AlphaZero

Silver et al. (2017b), DQN Mnih et al. (2015)

, etc. These methods combine deep learning with reinforcement learning to directly learn Q-values of high-dimensional states and discrete actions from sampled data. In policy optimization DRL methods, Trust Region Policy Optimization (TRPO) is effective for optimizing large nonlinear policies

Schulman et al. (2015). Proximal Policy Optimization (PPO) Schulman et al. (2017), developed from the actor-critic framework Konda and Tsitsiklis (2000), is simpler in calculation, more general to apply, and have better sample complexity (empirically) when compared with TRPO.

In this study, we develop a DRL approach named PPO-RA to address the challenges, which incorporates PPO with several techniques to cope with constraints and non-Markovian characteristics in online route assignment (RA). The main contributions are as follows:

  • We develop a model-free DRL approach with dedicated design of state and reward for online decision-making under multiple constraints.

  • we incorporate PPO with feature separation to cope with non-Markovian characteristics. We design neural networks with attention mechanism and shared parameters/structures to handle variability of parcels.

  • We train the policy using recorded real-world parcel data from Cainiao Network. Validation experiments show that PPO-RA outperforms currently applied statistics methods in logistics industry. Besides, the solutions from PPO-RA are also close to offline optimal solution from Mixed Integral Programming (MIP), to which all parcels are inputted one-off to solve in simulation.

Background and Related Work

Online Decision-Making. The online route-assignment task can be deemed as an online decision-making problem. The problem is emerged in various fields (e.g., online advertising, resource allocation, etc.) and promotes the development of algorithm. For example, the well-known Adwords problem Mehta et al. (2007) requires the decision maker to assign sequentially arrived keywords to bidders in order to maximize profit, subject to budget constraints for the bidders. The primal-dual method is a powerful algorithm which is proved applicable for a variety of problems requiring approximation solutions. Buchbinder and Naor Buchbinder and Naor (2009) extend primal-dual method to online scenarios. Recently, Devanur et al. (2011) present algorithms for solving resource allocation problems and introduce adversarial stochastic input model, a new distributional model which has approximation speed to achieve near optimal solutions.

Deep Reinforcement Learning. Reinforcement learning (RL) framework provides a mathematical formalism which is widely applicable in sequential decision-making problems. In RL framework, the agent observes a state from the environment at each time step , and then make an action according to its policy . After an action is taken, the state transits to the next state and a reward is sent back from environment to the agent. The goal of RL is maximizing accumulated discounted reward by learning an optimal policy, where . To handle high-dimensional state and action space, deep reinforcement learning (DRL) methods employ neural networks for function approximation Mnih et al. (2013).The most successful achievements include AlphaGo Silver et al. (2017b) and AlphaZero Silver et al. (2017a), which convincingly defeated world champion programs in chess, Go, and Shogi, without any domain knowledge other than underlying rules as input during training.

Model-free reinforcement learning methods mainly include value learning methods and policy gradient methods. Value learning methods are aimed at explicit learning of value functions from which the optimal policy can be obtained. A commonly used branch of value learning includes Deep Q-Network (DQN) Mnih et al. (2013) and its variants (e.g., Rainbow Hessel et al. (2018)). DQN variants are mostly suitable for discrete action space and are successful in mastering a range of Atari 2600 games. The policy gradient methods, meanwhile, attempt to learn optimal policies directly. Policy gradient methods with the assistance of baselines (e.g., value functions) are also referred to as Actor-Critic methods, which is suitable for both discrete and continuous action space. Representative Actor-Critic methods are (DDPG) Lillicrap et al. (2015), TRPO Schulman et al. (2015) and PPO Schulman et al. (2017)

etc. TRPO develops a series of approximations and the original policy gradient problem is converted to minimizing a surrogate loss function with the constraint of KL divergence between old and new policy, which guarantees policy improvement with non-trivial step sizes. As also a trust-region method, the approximation in PPO

Schulman et al. (2017) is simplified without reliability losing when compared with TRPO, which is more applicable to large-scale decision problems.

Reinforcement Learning for Online Decision-Making. There have been an increasing number of studies on employing DRL methods for industrial decision-making problems. Zhang and Diettterich Zhang and Dietterich (1995) utilized temporal difference learning

to learn a heuristic evaluation function over states to learn domain-specific heuristics for job-shop scheduling. Tesauro et al.

Tesauro and others (2005); Tesauro et al. (2006)

showed the feasibility of online RL to learn resource valuation estimates which can be used to make high-quality server allocation decisions in multi-application prototype data center scenario.

Recently, Ye Li and Juang Ye et al. (2019) develop a novel decentralized resource allocation mechanism for vehicle-to-vehicle (V2V) communications based on DRL. In order to reach the objective of minimizing power consumption and meeting demands of wireless users over a long operational period, Xu et al. Xu et al. (2017) present a novel DRL-based framework for power-efficient resource allocation in cloud RANs. Du et al. (2019)

learn a policy that maximizes net profit of the cloud provider through trial and error, which integrates long short-term memory (LSTM) neural networks into improved DDPG to deal with online user arrivals, which address both resource allocation and pricing. Most of these studies make an assumption that the environment is Markovian. Nevertheless, we assume that parcel arriving is independent identically distribution (i.i.d.) due to non-Markovian characteristics of parcel arrival, and produce optimal policy through the learning of state transition of constraint state.

Problem Modeling

Here, we formulate the constrained online route-assignment task as a reinforcement learning problem. Let be the information from an incoming parcel arrived at time . As shown in Figure 2, the environment can be viewed as the integration of parcel arrival system and constraint state. The agent is responsible for assigning a proper candidate route for each incoming parcel. In each time step, a parcel arrives and the agent take an action (i.e. choose one route from all candidate ones of ). After an action is taken, the state will be transition to state

with probability 1. At the same time, the agent will get an immediate reward

. Finally, the agent waits for the next parcel to arrive and repeats the above process. Our goal is to obtain an agent with optimal policy to allocate candidate routes for daily shipping parcels.

Parcel information and business goals. Parcel information consists of parcel attributes and candidate-route attributes. Parcel attributes are a parcel’s own attributes including i. the origin address, ii. the destination address, iii. parcel weight, iv. creation time, v. parcel ID. Candidate-route attributes are attributes for all candidate routes of a parcel. Consider a typical type of candidate route containing only two sections (i.e., first-mile and last-mile sections), then each route has its own candidate-route attributes including i. the first-mile provider, ii. the last-mile provider, iii. the last-mile hub, iv. cost.

The mostly adopted business goals of route assignment for a fixed period are: i. minimize the total cost summed by the assigned routes; ii. ensure the maximal capacity of each hub of a certain section not to be exceeded or the exceeding of maximal capacity not beyond a certain degree, in which the capacity is the maximal number of parcels that can be stored; iii. ensure delivery proportion of each required provider in a certain section to be kept in given ranges under given OD pairs, in which the proportion is the number of transported parcels of a certain provider to that of all required providers under given OD pairs. In this study, we use “day” as the fixed period of transportation, and last-mile section as the required section.

These goals can be converted to objective and constraints. Goal i. is the objective of minimizing total cost as low as possible. Goal ii and iii can be depicted respectively as

  • Hub capacity constraint: the maximal (i.e., upper bound) capacity for each hub in last-mile section. We name it Type-A constraint.

  • Route proportion constraint: the range of delivery proportion of each required provider in last-mile section of a certain route. We name it Type-B constraint.

Let be the set of parcels, be the set of all candidate routes for parcel , be the set of all constraints, be the routes corresponding to constraint , be the lower bound of constraint , be the upper bound of constraint , be the lower bound proportion of Type-B constraint for route and be the upper bound proportion of Type-B constraint for route . If all incoming parcels were known ahead of a day, then the optimal solution could be obtained via solving the following mixed-integer programming (MIP) problem.

Figure 2: Route-assignment task as a reinforcement learning problem.

For Type-A constraint, and is the upper bound capacity of hub . For Type-B constraint, where is the number of candidate routes corresponding to Type-B constraint .

However, the future incoming parcels has non-Markovian characteristics and are almost unpredictable. We make an assumption that the next incoming parcel is irrelevant with current constraint state and previous parcels. Therefore, the next state of environment is not completely decided by the previous state and action, which violates the MDP assumption of RL algorithms. In order to address this violation, we reconstruct MDP dynamics for this constrained online route-assignment problem.

Let be the set of all hubs with Type-A constraints, be the set of all routes with Type-B constraints, be the maximal capacity of hub , be the current remaining capacity of hub at time , and be the current proportion of Type-B constraint for route at time . Then, the definitions of the components of our reinforcement learning problem are as follows:

  • State: Constraint satisfaction dynamics obeys Markovian state transition, which means . Hence, the state of MDP is defined by the current constraint state , which is composed of

    • hub capacity state .

    • route proportion state .

  • Action: The action is to choose or assign one of the candidate routes for each parcel. The action depends on both constraint state and parcel information, i.e. .

  • Reward: The Design of reward is the most challenging part of the problem. At each time step, the immediate reward should integrate constraint state and parcel information. Since the objective is to minimize the total cost, the first part of reward is cost of the assign route , which depend on action . For Type-A constraints, the smaller the remaining capacity, the greater the penalty. For Type-B constraints, we will give encouragement to make the proportion close to the lower bound if current proportion is less than it. Otherwise, we will give punishment if the current proportion is greater than the upper bound. Hence, the reward is designed as follows:



    is a hyperparameter to leverage the inportance of constraint state function

    and cost . If constraint is hub capacity constraint, then


    if constraint is route proportion constraint , then


DRL Algorithm

Proximal policy optimization (PPO) (Schulman et al., 2017) is a commonly used RL algorithm with good application performance. In this section, we describe how to apply and adapt PPO to the route-assignment task. As an Actor-Critic algorithm, policy function and state value function (often represented by actor network and critic network) need to be estimated jointly or separately in PPO during training. One style of advantage function with its temporal-difference (TD) error utilized in value functions estimation is:

Due to the non-Markovian parcel arrival dynamics in a route-assignment task, we assume that the incoming parcels are unpredictable. The state value , hence, cannot be estimated. On the other hand, the state value depends not only on parcel information but also on constraint state. Therefore, the advantage can be defined as the following


where is the state value function which is represented by critic network. The loss function for critic network update is


Feature Separation. We propose feature separation to handle non-Markovian and Markovian attributes (i.e., parcel information and constraint state ) properly. A constraint table containing newest global constraint satisfaction information is always being maintained during training or testing. The constraint table will be updated once a parcel is assigned. Furthermore, several in-time preprocessing procedures, namely feature separation, are conducted to convert information of each incoming parcel (i.e., parcel attributes and candidate-route attributes) into valid input to actor and critic network. The input after preprocessing contains two part: parcel features and candidate-route features.

  • Parcel features. Parcel features can be viewed as a part of parcel information

    . All or a part of parcel attributes can be processed as parcel features. Here, we discretize parcel weight, origin address and destination address and concatenate the result to a vector as parcel features.

  • Candidate-route features. Candidate-route features can be viewed as combination of a part of parcel information and constraint state . Let be the maximum possible number of candidate routes for each parcel, and the number of candidate routes for a certain parcel to assign. Firstly, we reorder the sequence of candidate routes ascendingly by their costs. Thus, the route with minimal cost will be relocated in the first place. Then, we query the constraint table using several key elements (i.e., origin address, destination address, hubs and last-mile providers), in order to get constraint state regarding to this parcel. If , the candidate-route features will contain dummy routes with default costs and default constraint states, of which the values are set to be all 0. Finally, the candidate-route features are concatenated by route vectors as , where is route vector for each reordered real or dummy candidate route and the corresponding cost, .

Actor network. As shown in Figure 3

, parcel features and candidate-route features are respectively inputted to different embedding layers with Multiple Layer Perceptron (MLP) afterwards. We use several branches of neural network to receive route vectors in candidate-route features, with each branch identical parameters. Since route vectors of dummy routes may exist, we add a mask to convert the output from Dot-product

Concat for dummy routes to negative infinites. Therefore, the probability of these dummy routes output by Softmax layer will always be 0.

Figure 3: The actor network. Parameter sharing is applied to route vectors in candidate-route features and probabilities of assigning each route are output from Softmax layer.

Critic network. The critic network in Figure 4 is similar to actor network. Parameter sharing is also utilized to accommodate candidate-route features. And attention mechanism Vaswani et al. (2017) is employed to calculate state value for each incoming parcel. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with corresponding key. We treat parcel features as a query, and candidate-route features are used to generate key and value, which is


where . Then, the weighted sum is computed by


Finally, state value can be obtained by

Figure 4: The critic network. Parameter sharing is applied to route vectors in candidate-route features. Masked-Attention layers are used for calculating state value function given a certain parcel and current constraint state.

Therefore, the improved clipped optimization objective for policy updating is


where .

Input: initial policy parameters and initial value function parameters .

1:  for  do
2:     Collect set of trajectories by running policy in the environment.
3:     Compute rewards for each trajectory.
4:     Compute advantage estimates, .
5:     Update the policy by maximizing the PPO objective:

typically via stochastic gradient descent with Adam.

6:     Fit value function by regression on mean-squared error:
typically via stochastic gradient descent with Adam.
7:  end for
Algorithm 1 Proximal policy optimization for route assignment task (PPO-RA)

Our complete DRL algorithm is proposed in Alg. 1. The trajectories are collected in parallel through policy (line 2). Then the network parameters and are updated using the trajectories.

Performance Evaluation

Algorithm Implementation.

We implement and evaluate PPO-RA on a workstation computer (ubuntu 16.04), which is equipped with Intel Xeon Platinum 8163 @ 2.50 GHz, 32G memory, and a Nvidia Tesla V100 GPU with 16G memory. We use PyTorch

Paszke et al. (2019)

for implementation. For neuron network settings, embedding dimensions are 64 in both actor and critic network. For actor network, the MLP part for parcel features has a single layer of 128 neurons with ReLU activation. The MLP part for candidate-route features has two layers, which are a layer of 256 neurons with ReLU activation and a linear layer followed with 128 neurons. For critic network, the MLP parts have the same settings with actor network. Following masked-attention part, the last MLP part has a layer with 64 neurons with Sigmoid activation following by a linear layer with 1 neuron. Learning rates for actor and critic network are respectively

and . We set and for Type-A and Type-B constraint respectively.

Dataset. We utilize datasets of multiple countries recorded in Cainiao Network. A dataset contains two parts of data, namely parcel data and constraint configuration data.

  • Parcel data. This dataset contains historical delivered parcels in multiple days which are delivered within a certain country sorted by creation time. Each row of the dataset is parcel information introduced in section Problem Modeling.

  • Constraint configuration data. This dataset contains constraint configurations for a certain country, such as hub capacity constraints and route proportion constraints.

The datasets of country #1 and country #2 are used for training and testing. We use code to denote the name of a country for confidential reasons. Dataset #1 only contains 625 hub capacity constraints, and the number of daily created parcels to assign reaches from 567429 to 806824. Dataset #2 contains 51 route proportion constraints only, and the number of daily created parcels reaches from 293208 to 326332.

In training procedure, we select parcels of a certain country (#1 or #2) created within day from parcel data and drop them to simulation for trajectory collection. The simulation is able to send the parcels to agent in a certain order (e.g., sorted by creation time) for route assignment. We train the agent for about 20 times of iteration and get convergence, which takes about 10 hours. For each iteration, we firstly collect 50 trajectories in parallel and put all MDP tuples in the trajectories to a list, then shuffle the list and update parameters of actor and critic network using Adam with that list. The mini-batch size for gradient descent is 2048.

In testing procedure, we select parcels of a certain country (#1 or #2) created within the next three days (i.e., , , ) for testing.

Baselines. We compare PPO-RA with 3 baseline methods applied in logistics industry:

  1. MIP: Online route-assignment problem could be reformulated into a mixed-integer programming (MIP) problem (1) if all parcels to assign were known. The solution is the optimal solution of the online route-assignment problem. Therefore, we adopt MIP gap to measure the performance difference between MIP and a certain algorithm (PPO-RA and others). We use SCIP Achterberg (2009), a commonly used solver to obtain MIP solutions.

  2. Proportion: It is an effective algorithm that is commonly used for online route-assignment task. Proportion algorithm relies on MIP solutions. Firstly, a MIP solution for historical delivered parcel data for a period of time (e.g., one month) is solved in order to obtain the assignment probability for each last-mile provider in OD-pair level and discretized weight category level. This probability is called offline assignment probability. When applied to online assignment, Proportion algorithm queries offline assignment probabilities by last-mile provider in each candidate route of an incoming parcel. Then, roulette wheel selection is used to assign a route for that parcel, in which the candidate route with highest offline assignment probability tends to be assigned. However, Proportion algorithm’s solutions almost always deviate from the optimal solutions since the parcel information varies among days.

  3. Greedy: Greedy algorithm does not produce optimal solutions in a variety of problems, but can yield local optimal solutions that near global optimum after a reasonable amount of solving time. If there are no constraints, greedy algorithm can achieve optimal solution since the objective is minimizing total cost. In online route-assignment problem, a greedy algorithm that assigns the candidate route with smallest cost for each parcel can be taken as an online method.

Accordingly, performance metrics are:

  • average cost: the total cost of assigned parcels divided by the number of assigned parcels. For confidential reasons, we linearly transform the average cost to another number.

  • MIP gap: the difference between average cost of MIP solution and average cost of the compared algorithm’s solution, divided by average cost of MIP solution.

  • constraint violation rate: the number of parcels with constraint violation after route assignment divided by the total number of parcels. Notably, MIP solution has zero of constraint violation rate for hub capacity constraint and less than 1% for route proportion constraint since MIP solution is the optimal solution solved in offline manner.

Evaluation results. Table 1 and 2 show the average cost of parcel and MIP gap achieved by PPO-RA and Proportion algorithm using #1 and #2 dataset. PPO-RA achieves about 0.2-0.3% cost reduction than Proportion algorithm. For constraint violation rates shown in figure 5 and 6, PPO-RA has almost the same constraint violation rates of hub capacity constraint with Proportion algorithm. And violation rates of route proportion constraint for PPO-RA is lower than Proportion and Greedy algorithm. In summary, the PPO-RA can obtain less average cost of parcels (i.e., closer to MIP solutions) without more constraint violation rates than other compared baselines.

Algorithm Average Cost MIP Gap
PPO-RA 100.73 -0.0706%
T+1 Proportion 100.05 -0.3843%
MIP 100.66 0%
PPO-RA 99.789 -0.0671%
T+2 Proportion 100.19 -0.4707%
MIP 99.719 0%
PPO-RA 98.482 -0.0862%
T+3 Proportion 98.927 -0.5403%
MIP 98.396 0%
Table 1: The evaluation results for cost in country #1. PPO-RA reduces about 0.3%-0.4% of average cost compared with Proportion algorithm.
Algorithm Average Cost MIP Gap
PPO-RA 81.193 0.1277%
T+1 Proportion 81.459 -0.1968%
MIP 81.297 0%
PPO-RA 78.566 -0.0495%
T+2 Proportion 78.723 -0.2512%
MIP 78.523 0%
PPO-RA 84.755 -0.1213%
T+3 Proportion 84.930 -0.3307%
MIP 84.693 0%
Table 2: The evaluation results for cost in country #2. PPO-RA reduces about 0.2-0.3% of average cost compared with Proportion algorithm.
Figure 5: The evaluation results for constraint violation in country #1. Constraint violation rates of PPO-RA are almost the same with Proportion algorithm, and are less 0.3%-0.7% than Greedy algorithm.
Figure 6: The evaluation results for constraint violation in country #2. Constraint violation rates of PPO-RA are less 0.8%-1.5% than Proportion algorithm and less 3%-5% than Greedy algorithm.


This study focuses on a typical kind of online decision-making problem, namely constrained online logistics route assignment problem, which is aimed at assigning a proper logistics route for each incoming parcel so as to optimize a certain objective (e.g., minimizing total logistics cost) while satisfying most of business constraints such as constraints of hub capacity and delivery proportion of delivery providers. Several challenges exist in this problem including the large number (beyond ) of daily parcels to assign, the variability of the number and attributes of candidate routes for each parcel and the non-Markovian characteristics of parcel arrival dynamics. We propose a model-free DRL approach named PPO-RA to address these challenges, in which Proximal Policy Optimization (PPO) is improved with dedicated techniques designed for conducting constrained online route assignment (RA) tasks. The actor and critic networks adapt attention mechanism and parameter sharing to accommodate each incoming parcel with varying numbers and identities of candidate routes, without modeling non-Markovian parcel arrival dynamics since we make assumption of i.i.d. parcel arrival. With the utilization of datasets of delivery parcels in multiple countries recorded in Cainiao Network, the proposed approach is validated in comparison with commonly used baselines in logistics industry. The results are quite promising: in majority of the cases, PPO-RA obtains considerably more reduction of total cost while violating less constraints.


  • T. Achterberg (2009) SCIP: solving constraint integer programs. Mathematical Programming Computation 1 (1), pp. 1–41. Cited by: item 1.
  • N. Buchbinder and J. Naor (2009) The design of competitive online algorithms via a primal-dual approach. Now Publishers Inc. Cited by: Background and Related Work.
  • N. R. Devanur, K. Jain, B. Sivan, and C. A. Wilkens (2011) Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 29–38. Cited by: Background and Related Work.
  • B. Du, C. Wu, and Z. Huang (2019) Learning resource allocation and pricing for cloud profit maximization. In

    Proceedings of the AAAI conference on artificial intelligence

    Vol. 33, pp. 7570–7577. Cited by: Background and Related Work.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, Cited by: Background and Related Work.
  • V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: Introduction.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Background and Related Work.
  • A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani (2007) Adwords and generalized online matching. Journal of the ACM (JACM) 54 (5), pp. 22–es. Cited by: Background and Related Work.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: Background and Related Work, Background and Related Work.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: Introduction.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: Performance Evaluation.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In

    International conference on machine learning

    pp. 1889–1897. Cited by: Introduction, Background and Related Work.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Introduction, Background and Related Work, DRL Algorithm.
  • D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: Background and Related Work.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017b) Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: Introduction, Background and Related Work.
  • G. Tesauro, N. K. Jong, R. Das, and M. N. Bennani (2006) A hybrid reinforcement learning approach to autonomic resource allocation. In 2006 IEEE International Conference on Autonomic Computing, pp. 65–73. Cited by: Background and Related Work.
  • G. Tesauro et al. (2005) Online resource allocation using decompositional reinforcement learning. In AAAI, Vol. 5, pp. 886–891. Cited by: Background and Related Work.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: DRL Algorithm.
  • Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy (2017) A deep reinforcement learning based framework for power-efficient resource allocation in cloud rans. In 2017 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: Background and Related Work.
  • H. Ye, G. Y. Li, and B. F. Juang (2019) Deep reinforcement learning based resource allocation for v2v communications. IEEE Transactions on Vehicular Technology 68 (4), pp. 3163–3173. Cited by: Background and Related Work.
  • W. Zhang and T. G. Dietterich (1995) A reinforcement learning approach to job-shop scheduling. In IJCAI, Vol. 95, pp. 1114–1120. Cited by: Background and Related Work.