Introduction
In recent years, the scale of parcels has gained fast increasing due to the development of ecommerce platforms and the popularization of online shopping. As an example, millions of parcels are being delivered within each country or across countries in Southeast Asia, which brings forth the urge of elevating informatization and efficiency of delivery for logistics industry. Usually, several logistics routes are available for each parcel to transport and each route is composed of one to many delivery providers. All providers in a certain route are responsible for their own transportation section within that route. In logistics industry, one of the fundamental problems is how to arrange a proper route for each sequentially incoming parcel so as to fulfill requirements or goals from business strategies. The problem is referred to as the online routeassignment problem.
As shown in Figure 1, a logistics route usually consists of the following elements. i. section: a route can be split into several logistics sections, e.g., firstmile or lastmile section. ii. provider: each section is corresponding with a transportation (or delivery) provider to fulfill the delivery task; iii. transit hub: each transit hub has one or several providers to drop off the parcels. In each section, the provider always has one hub to make delivery. iv. cost: each route has a corresponding transportation cost of all sections within the route. A route’s identity can be represented by the concatenation of all providers’ names in that route. A parcel to be delivered usually contains several candidate routes and the number of routes is varying. Besides, the identities of candidate routes for different parcels may also vary since there is a number of available providers and the possible combinations are varying. Also, the costs of routes may be different when comparing the candidate routes with identical identity among different parcels. That is mainly because the pairs of origin addresses and destination addresses (i.e., OD pairs) are different among parcels.
The online routeassignment problem introduced above can also be viewed as an online decisionmaking problem, in which the decision is to determine the route to assign while considering objective and constraints converted from business goals which will be described later. The challenges regarding to the problem lie in the following aspects.

The amount and variability of decisionmaking tasks. The total amount of parcels to assign reaches per day per country in Southeast Asia, which is a relatively large number for many online decisionmaking algorithms to attain promising results. Besides, the varying numbers and identities of candidate routes among parcels increases difficulty of assignment.

The online and constrained characteristics of decisionmaking problem. The decisions are made based on limited known information, trying to meet the objective while trying not to violate the constraints. The total amount of parcels and information of future incoming parcels are difficult to predict, which will remain unobservable until the end of period or until the accomplishment of assigning for the last parcel.

The nonMarkovian characteristics of incoming parcels. The next incoming parcel, with its candidate routes, is independent of the decision made for the last parcel. However, the status of constraints (e.g., the remaining capacity of hubs) follows Markovian dynamics.
The most basic method for solving this problem is greedy method, which always assigns the route with lowest cost in all candidate routes for each incoming parcel without considering additional parcel information. However, the serious violation of constraints makes it inappropriate for application. On the other hand, several deep reinforcement learning (DRL) approaches with the capability of extracting potentially valuable characteristics from given features shed a light on solving the problem. Those DRL approaches with neural networks (NN) as function approximators achieve a massive success on games playing, which include AlphaZero
Silver et al. (2017b), DQN Mnih et al. (2015), etc. These methods combine deep learning with reinforcement learning to directly learn Qvalues of highdimensional states and discrete actions from sampled data. In policy optimization DRL methods, Trust Region Policy Optimization (TRPO) is effective for optimizing large nonlinear policies
Schulman et al. (2015). Proximal Policy Optimization (PPO) Schulman et al. (2017), developed from the actorcritic framework Konda and Tsitsiklis (2000), is simpler in calculation, more general to apply, and have better sample complexity (empirically) when compared with TRPO.In this study, we develop a DRL approach named PPORA to address the challenges, which incorporates PPO with several techniques to cope with constraints and nonMarkovian characteristics in online route assignment (RA). The main contributions are as follows:

We develop a modelfree DRL approach with dedicated design of state and reward for online decisionmaking under multiple constraints.

we incorporate PPO with feature separation to cope with nonMarkovian characteristics. We design neural networks with attention mechanism and shared parameters/structures to handle variability of parcels.

We train the policy using recorded realworld parcel data from Cainiao Network. Validation experiments show that PPORA outperforms currently applied statistics methods in logistics industry. Besides, the solutions from PPORA are also close to offline optimal solution from Mixed Integral Programming (MIP), to which all parcels are inputted oneoff to solve in simulation.
Background and Related Work
Online DecisionMaking. The online routeassignment task can be deemed as an online decisionmaking problem. The problem is emerged in various fields (e.g., online advertising, resource allocation, etc.) and promotes the development of algorithm. For example, the wellknown Adwords problem Mehta et al. (2007) requires the decision maker to assign sequentially arrived keywords to bidders in order to maximize profit, subject to budget constraints for the bidders. The primaldual method is a powerful algorithm which is proved applicable for a variety of problems requiring approximation solutions. Buchbinder and Naor Buchbinder and Naor (2009) extend primaldual method to online scenarios. Recently, Devanur et al. (2011) present algorithms for solving resource allocation problems and introduce adversarial stochastic input model, a new distributional model which has approximation speed to achieve near optimal solutions.
Deep Reinforcement Learning. Reinforcement learning (RL) framework provides a mathematical formalism which is widely applicable in sequential decisionmaking problems. In RL framework, the agent observes a state from the environment at each time step , and then make an action according to its policy . After an action is taken, the state transits to the next state and a reward is sent back from environment to the agent. The goal of RL is maximizing accumulated discounted reward by learning an optimal policy, where . To handle highdimensional state and action space, deep reinforcement learning (DRL) methods employ neural networks for function approximation Mnih et al. (2013).The most successful achievements include AlphaGo Silver et al. (2017b) and AlphaZero Silver et al. (2017a), which convincingly defeated world champion programs in chess, Go, and Shogi, without any domain knowledge other than underlying rules as input during training.
Modelfree reinforcement learning methods mainly include value learning methods and policy gradient methods. Value learning methods are aimed at explicit learning of value functions from which the optimal policy can be obtained. A commonly used branch of value learning includes Deep QNetwork (DQN) Mnih et al. (2013) and its variants (e.g., Rainbow Hessel et al. (2018)). DQN variants are mostly suitable for discrete action space and are successful in mastering a range of Atari 2600 games. The policy gradient methods, meanwhile, attempt to learn optimal policies directly. Policy gradient methods with the assistance of baselines (e.g., value functions) are also referred to as ActorCritic methods, which is suitable for both discrete and continuous action space. Representative ActorCritic methods are (DDPG) Lillicrap et al. (2015), TRPO Schulman et al. (2015) and PPO Schulman et al. (2017)
etc. TRPO develops a series of approximations and the original policy gradient problem is converted to minimizing a surrogate loss function with the constraint of KL divergence between old and new policy, which guarantees policy improvement with nontrivial step sizes. As also a trustregion method, the approximation in PPO
Schulman et al. (2017) is simplified without reliability losing when compared with TRPO, which is more applicable to largescale decision problems.Reinforcement Learning for Online DecisionMaking. There have been an increasing number of studies on employing DRL methods for industrial decisionmaking problems. Zhang and Diettterich Zhang and Dietterich (1995) utilized temporal difference learning
to learn a heuristic evaluation function over states to learn domainspecific heuristics for jobshop scheduling. Tesauro et al.
Tesauro and others (2005); Tesauro et al. (2006)showed the feasibility of online RL to learn resource valuation estimates which can be used to make highquality server allocation decisions in multiapplication prototype data center scenario.
Recently, Ye Li and Juang Ye et al. (2019) develop a novel decentralized resource allocation mechanism for vehicletovehicle (V2V) communications based on DRL. In order to reach the objective of minimizing power consumption and meeting demands of wireless users over a long operational period, Xu et al. Xu et al. (2017) present a novel DRLbased framework for powerefficient resource allocation in cloud RANs. Du et al. (2019)
learn a policy that maximizes net profit of the cloud provider through trial and error, which integrates long shortterm memory (LSTM) neural networks into improved DDPG to deal with online user arrivals, which address both resource allocation and pricing. Most of these studies make an assumption that the environment is Markovian. Nevertheless, we assume that parcel arriving is independent identically distribution (i.i.d.) due to nonMarkovian characteristics of parcel arrival, and produce optimal policy through the learning of state transition of constraint state.
Problem Modeling
Here, we formulate the constrained online routeassignment task as a reinforcement learning problem. Let be the information from an incoming parcel arrived at time . As shown in Figure 2, the environment can be viewed as the integration of parcel arrival system and constraint state. The agent is responsible for assigning a proper candidate route for each incoming parcel. In each time step, a parcel arrives and the agent take an action (i.e. choose one route from all candidate ones of ). After an action is taken, the state will be transition to state
with probability 1. At the same time, the agent will get an immediate reward
. Finally, the agent waits for the next parcel to arrive and repeats the above process. Our goal is to obtain an agent with optimal policy to allocate candidate routes for daily shipping parcels.Parcel information and business goals. Parcel information consists of parcel attributes and candidateroute attributes. Parcel attributes are a parcel’s own attributes including i. the origin address, ii. the destination address, iii. parcel weight, iv. creation time, v. parcel ID. Candidateroute attributes are attributes for all candidate routes of a parcel. Consider a typical type of candidate route containing only two sections (i.e., firstmile and lastmile sections), then each route has its own candidateroute attributes including i. the firstmile provider, ii. the lastmile provider, iii. the lastmile hub, iv. cost.
The mostly adopted business goals of route assignment for a fixed period are: i. minimize the total cost summed by the assigned routes; ii. ensure the maximal capacity of each hub of a certain section not to be exceeded or the exceeding of maximal capacity not beyond a certain degree, in which the capacity is the maximal number of parcels that can be stored; iii. ensure delivery proportion of each required provider in a certain section to be kept in given ranges under given OD pairs, in which the proportion is the number of transported parcels of a certain provider to that of all required providers under given OD pairs. In this study, we use “day” as the fixed period of transportation, and lastmile section as the required section.
These goals can be converted to objective and constraints. Goal i. is the objective of minimizing total cost as low as possible. Goal ii and iii can be depicted respectively as

Hub capacity constraint: the maximal (i.e., upper bound) capacity for each hub in lastmile section. We name it TypeA constraint.

Route proportion constraint: the range of delivery proportion of each required provider in lastmile section of a certain route. We name it TypeB constraint.
Let be the set of parcels, be the set of all candidate routes for parcel , be the set of all constraints, be the routes corresponding to constraint , be the lower bound of constraint , be the upper bound of constraint , be the lower bound proportion of TypeB constraint for route and be the upper bound proportion of TypeB constraint for route . If all incoming parcels were known ahead of a day, then the optimal solution could be obtained via solving the following mixedinteger programming (MIP) problem.
(1)  
s.t.  
For TypeA constraint, and is the upper bound capacity of hub . For TypeB constraint, where is the number of candidate routes corresponding to TypeB constraint .
However, the future incoming parcels has nonMarkovian characteristics and are almost unpredictable. We make an assumption that the next incoming parcel is irrelevant with current constraint state and previous parcels. Therefore, the next state of environment is not completely decided by the previous state and action, which violates the MDP assumption of RL algorithms. In order to address this violation, we reconstruct MDP dynamics for this constrained online routeassignment problem.
Let be the set of all hubs with TypeA constraints, be the set of all routes with TypeB constraints, be the maximal capacity of hub , be the current remaining capacity of hub at time , and be the current proportion of TypeB constraint for route at time . Then, the definitions of the components of our reinforcement learning problem are as follows:

State: Constraint satisfaction dynamics obeys Markovian state transition, which means . Hence, the state of MDP is defined by the current constraint state , which is composed of

hub capacity state .

route proportion state .


Action: The action is to choose or assign one of the candidate routes for each parcel. The action depends on both constraint state and parcel information, i.e. .

Reward: The Design of reward is the most challenging part of the problem. At each time step, the immediate reward should integrate constraint state and parcel information. Since the objective is to minimize the total cost, the first part of reward is cost of the assign route , which depend on action . For TypeA constraints, the smaller the remaining capacity, the greater the penalty. For TypeB constraints, we will give encouragement to make the proportion close to the lower bound if current proportion is less than it. Otherwise, we will give punishment if the current proportion is greater than the upper bound. Hence, the reward is designed as follows:
(2) where
is a hyperparameter to leverage the inportance of constraint state function
and cost . If constraint is hub capacity constraint, then(3) if constraint is route proportion constraint , then
(4)
DRL Algorithm
Proximal policy optimization (PPO) (Schulman et al., 2017) is a commonly used RL algorithm with good application performance. In this section, we describe how to apply and adapt PPO to the routeassignment task. As an ActorCritic algorithm, policy function and state value function (often represented by actor network and critic network) need to be estimated jointly or separately in PPO during training. One style of advantage function with its temporaldifference (TD) error utilized in value functions estimation is:
Due to the nonMarkovian parcel arrival dynamics in a routeassignment task, we assume that the incoming parcels are unpredictable. The state value , hence, cannot be estimated. On the other hand, the state value depends not only on parcel information but also on constraint state. Therefore, the advantage can be defined as the following
(5) 
where is the state value function which is represented by critic network. The loss function for critic network update is
(6) 
Feature Separation. We propose feature separation to handle nonMarkovian and Markovian attributes (i.e., parcel information and constraint state ) properly. A constraint table containing newest global constraint satisfaction information is always being maintained during training or testing. The constraint table will be updated once a parcel is assigned. Furthermore, several intime preprocessing procedures, namely feature separation, are conducted to convert information of each incoming parcel (i.e., parcel attributes and candidateroute attributes) into valid input to actor and critic network. The input after preprocessing contains two part: parcel features and candidateroute features.

Parcel features. Parcel features can be viewed as a part of parcel information
. All or a part of parcel attributes can be processed as parcel features. Here, we discretize parcel weight, origin address and destination address and concatenate the result to a vector as parcel features.

Candidateroute features. Candidateroute features can be viewed as combination of a part of parcel information and constraint state . Let be the maximum possible number of candidate routes for each parcel, and the number of candidate routes for a certain parcel to assign. Firstly, we reorder the sequence of candidate routes ascendingly by their costs. Thus, the route with minimal cost will be relocated in the first place. Then, we query the constraint table using several key elements (i.e., origin address, destination address, hubs and lastmile providers), in order to get constraint state regarding to this parcel. If , the candidateroute features will contain dummy routes with default costs and default constraint states, of which the values are set to be all 0. Finally, the candidateroute features are concatenated by route vectors as , where is route vector for each reordered real or dummy candidate route and the corresponding cost, .
Actor network. As shown in Figure 3
, parcel features and candidateroute features are respectively inputted to different embedding layers with Multiple Layer Perceptron (MLP) afterwards. We use several branches of neural network to receive route vectors in candidateroute features, with each branch identical parameters. Since route vectors of dummy routes may exist, we add a mask to convert the output from Dotproduct
Concat for dummy routes to negative infinites. Therefore, the probability of these dummy routes output by Softmax layer will always be 0.
Critic network. The critic network in Figure 4 is similar to actor network. Parameter sharing is also utilized to accommodate candidateroute features. And attention mechanism Vaswani et al. (2017) is employed to calculate state value for each incoming parcel. An attention function can be described as mapping a query and a set of keyvalue pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with corresponding key. We treat parcel features as a query, and candidateroute features are used to generate key and value, which is
(7) 
where . Then, the weighted sum is computed by
(8) 
Finally, state value can be obtained by
(9) 
Therefore, the improved clipped optimization objective for policy updating is
(10)  
where .
Input: initial policy parameters and initial value function parameters .
typically via stochastic gradient descent with Adam.
Our complete DRL algorithm is proposed in Alg. 1. The trajectories are collected in parallel through policy (line 2). Then the network parameters and are updated using the trajectories.
Performance Evaluation
Algorithm Implementation.
We implement and evaluate PPORA on a workstation computer (ubuntu 16.04), which is equipped with Intel Xeon Platinum 8163 @ 2.50 GHz, 32G memory, and a Nvidia Tesla V100 GPU with 16G memory. We use PyTorch
Paszke et al. (2019)for implementation. For neuron network settings, embedding dimensions are 64 in both actor and critic network. For actor network, the MLP part for parcel features has a single layer of 128 neurons with ReLU activation. The MLP part for candidateroute features has two layers, which are a layer of 256 neurons with ReLU activation and a linear layer followed with 128 neurons. For critic network, the MLP parts have the same settings with actor network. Following maskedattention part, the last MLP part has a layer with 64 neurons with Sigmoid activation following by a linear layer with 1 neuron. Learning rates for actor and critic network are respectively
and . We set and for TypeA and TypeB constraint respectively.Dataset. We utilize datasets of multiple countries recorded in Cainiao Network. A dataset contains two parts of data, namely parcel data and constraint configuration data.

Parcel data. This dataset contains historical delivered parcels in multiple days which are delivered within a certain country sorted by creation time. Each row of the dataset is parcel information introduced in section Problem Modeling.

Constraint configuration data. This dataset contains constraint configurations for a certain country, such as hub capacity constraints and route proportion constraints.
The datasets of country #1 and country #2 are used for training and testing. We use code to denote the name of a country for confidential reasons. Dataset #1 only contains 625 hub capacity constraints, and the number of daily created parcels to assign reaches from 567429 to 806824. Dataset #2 contains 51 route proportion constraints only, and the number of daily created parcels reaches from 293208 to 326332.
In training procedure, we select parcels of a certain country (#1 or #2) created within day from parcel data and drop them to simulation for trajectory collection. The simulation is able to send the parcels to agent in a certain order (e.g., sorted by creation time) for route assignment. We train the agent for about 20 times of iteration and get convergence, which takes about 10 hours. For each iteration, we firstly collect 50 trajectories in parallel and put all MDP tuples in the trajectories to a list, then shuffle the list and update parameters of actor and critic network using Adam with that list. The minibatch size for gradient descent is 2048.
In testing procedure, we select parcels of a certain country (#1 or #2) created within the next three days (i.e., , , ) for testing.
Baselines. We compare PPORA with 3 baseline methods applied in logistics industry:

MIP: Online routeassignment problem could be reformulated into a mixedinteger programming (MIP) problem (1) if all parcels to assign were known. The solution is the optimal solution of the online routeassignment problem. Therefore, we adopt MIP gap to measure the performance difference between MIP and a certain algorithm (PPORA and others). We use SCIP Achterberg (2009), a commonly used solver to obtain MIP solutions.

Proportion: It is an effective algorithm that is commonly used for online routeassignment task. Proportion algorithm relies on MIP solutions. Firstly, a MIP solution for historical delivered parcel data for a period of time (e.g., one month) is solved in order to obtain the assignment probability for each lastmile provider in ODpair level and discretized weight category level. This probability is called offline assignment probability. When applied to online assignment, Proportion algorithm queries offline assignment probabilities by lastmile provider in each candidate route of an incoming parcel. Then, roulette wheel selection is used to assign a route for that parcel, in which the candidate route with highest offline assignment probability tends to be assigned. However, Proportion algorithm’s solutions almost always deviate from the optimal solutions since the parcel information varies among days.

Greedy: Greedy algorithm does not produce optimal solutions in a variety of problems, but can yield local optimal solutions that near global optimum after a reasonable amount of solving time. If there are no constraints, greedy algorithm can achieve optimal solution since the objective is minimizing total cost. In online routeassignment problem, a greedy algorithm that assigns the candidate route with smallest cost for each parcel can be taken as an online method.
Accordingly, performance metrics are:

average cost: the total cost of assigned parcels divided by the number of assigned parcels. For confidential reasons, we linearly transform the average cost to another number.

MIP gap: the difference between average cost of MIP solution and average cost of the compared algorithm’s solution, divided by average cost of MIP solution.

constraint violation rate: the number of parcels with constraint violation after route assignment divided by the total number of parcels. Notably, MIP solution has zero of constraint violation rate for hub capacity constraint and less than 1% for route proportion constraint since MIP solution is the optimal solution solved in offline manner.
Evaluation results. Table 1 and 2 show the average cost of parcel and MIP gap achieved by PPORA and Proportion algorithm using #1 and #2 dataset. PPORA achieves about 0.20.3% cost reduction than Proportion algorithm. For constraint violation rates shown in figure 5 and 6, PPORA has almost the same constraint violation rates of hub capacity constraint with Proportion algorithm. And violation rates of route proportion constraint for PPORA is lower than Proportion and Greedy algorithm. In summary, the PPORA can obtain less average cost of parcels (i.e., closer to MIP solutions) without more constraint violation rates than other compared baselines.
Algorithm  Average Cost  MIP Gap  

PPORA  100.73  0.0706%  
T+1  Proportion  100.05  0.3843% 
MIP  100.66  0%  
PPORA  99.789  0.0671%  
T+2  Proportion  100.19  0.4707% 
MIP  99.719  0%  
PPORA  98.482  0.0862%  
T+3  Proportion  98.927  0.5403% 
MIP  98.396  0% 
Algorithm  Average Cost  MIP Gap  

PPORA  81.193  0.1277%  
T+1  Proportion  81.459  0.1968% 
MIP  81.297  0%  
PPORA  78.566  0.0495%  
T+2  Proportion  78.723  0.2512% 
MIP  78.523  0%  
PPORA  84.755  0.1213%  
T+3  Proportion  84.930  0.3307% 
MIP  84.693  0% 
Conclusion
This study focuses on a typical kind of online decisionmaking problem, namely constrained online logistics route assignment problem, which is aimed at assigning a proper logistics route for each incoming parcel so as to optimize a certain objective (e.g., minimizing total logistics cost) while satisfying most of business constraints such as constraints of hub capacity and delivery proportion of delivery providers. Several challenges exist in this problem including the large number (beyond ) of daily parcels to assign, the variability of the number and attributes of candidate routes for each parcel and the nonMarkovian characteristics of parcel arrival dynamics. We propose a modelfree DRL approach named PPORA to address these challenges, in which Proximal Policy Optimization (PPO) is improved with dedicated techniques designed for conducting constrained online route assignment (RA) tasks. The actor and critic networks adapt attention mechanism and parameter sharing to accommodate each incoming parcel with varying numbers and identities of candidate routes, without modeling nonMarkovian parcel arrival dynamics since we make assumption of i.i.d. parcel arrival. With the utilization of datasets of delivery parcels in multiple countries recorded in Cainiao Network, the proposed approach is validated in comparison with commonly used baselines in logistics industry. The results are quite promising: in majority of the cases, PPORA obtains considerably more reduction of total cost while violating less constraints.
References
 SCIP: solving constraint integer programs. Mathematical Programming Computation 1 (1), pp. 1–41. Cited by: item 1.
 The design of competitive online algorithms via a primaldual approach. Now Publishers Inc. Cited by: Background and Related Work.
 Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 29–38. Cited by: Background and Related Work.

Learning resource allocation and pricing for cloud profit maximization.
In
Proceedings of the AAAI conference on artificial intelligence
, Vol. 33, pp. 7570–7577. Cited by: Background and Related Work.  Rainbow: combining improvements in deep reinforcement learning. In Thirtysecond AAAI conference on artificial intelligence, Cited by: Background and Related Work.
 Actorcritic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: Introduction.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Background and Related Work.
 Adwords and generalized online matching. Journal of the ACM (JACM) 54 (5), pp. 22–es. Cited by: Background and Related Work.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: Background and Related Work, Background and Related Work.
 Humanlevel control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: Introduction.
 Pytorch: an imperative style, highperformance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: Performance Evaluation.

Trust region policy optimization.
In
International conference on machine learning
, pp. 1889–1897. Cited by: Introduction, Background and Related Work.  Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Introduction, Background and Related Work, DRL Algorithm.
 Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: Background and Related Work.
 Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: Introduction, Background and Related Work.
 A hybrid reinforcement learning approach to autonomic resource allocation. In 2006 IEEE International Conference on Autonomic Computing, pp. 65–73. Cited by: Background and Related Work.
 Online resource allocation using decompositional reinforcement learning. In AAAI, Vol. 5, pp. 886–891. Cited by: Background and Related Work.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: DRL Algorithm.
 A deep reinforcement learning based framework for powerefficient resource allocation in cloud rans. In 2017 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: Background and Related Work.
 Deep reinforcement learning based resource allocation for v2v communications. IEEE Transactions on Vehicular Technology 68 (4), pp. 3163–3173. Cited by: Background and Related Work.
 A reinforcement learning approach to jobshop scheduling. In IJCAI, Vol. 95, pp. 1114–1120. Cited by: Background and Related Work.
Comments
There are no comments yet.