Deep Q-Learning for Same-Day Delivery with a Heterogeneous Fleet of Vehicles and Drones

10/25/2019 ∙ by Xinwei Chen, et al. ∙ 0

In this paper, we consider same-day delivery with a heterogeneous fleet of vehicles and drones. Customers make delivery requests over the course of the day and the dispatcher dynamically dispatches vehicles and drones to deliver the goods to customers before their delivery deadline. Vehicles can deliver multiple packages in one route but travel relatively slowly due to the urban traffic. Drones travel faster, but they have limited capacity and require charging or battery swaps. To exploit the different strengths of the fleets, we propose a deep Q-learning approach. Our method learns the value of assigning a new customer to either drones or vehicles as well as the option to not offer service at all. To aid feature selection, we present an analytical analysis that demonstrates the role that different types of information have on the value function and decision making. In a systematic computational analysis, we show the superiority of our policy compared to benchmark policies and the effectiveness of our deep Q-learning approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Same-day delivery (SDD) changes the way people shop as it combines immediate product availability and the convenience of ordering from electronic devices (2014same). Because of its attractive nature, SDD is expected to reach 15% of last-mile delivery volumes as soon as 2020 (joerss). Retailers have taken note with US retailers Amazon, Target, and Walmart racing to expand their SDD options with Target announcing in June 2019 that it will offer same-day delivery in 47 states (thomas2019). The market growth is also leading retailers new to SDD to enter the market. In April 2019, CVS Pharmacy announced same-day delivery of prescriptions to patients in the U.S. (cvs), and the grocery retailer Hy-Vee recently revealed its interest in same-day alcohol delivery in Omaha-Lincoln metro area, Nebraska (hyvee).

While consumers seek and retailers are eager to provide SDD, it is not without its challenges. The timing of requests and the delivery locations are not known until a customer places an order. Further, because of the need to meet tight delivery deadlines, consolidation opportunities are scarce, rendering the use of conventional delivery vehicles (hereafter referred to as “vehicles”) inefficient, especially for delivery in less dense areas of the city. In addition, vehicles are slowed by congestion on urban streets. As an alternative, companies have begun to complement vehicles with Unmanned Aerial Vehicles (hereafter referred to as “drones”). In 2016, Amazon Prime Air made its first drone delivery in Cambridgeshire, England (kim2016). In May 2019, Alphabet Inc., Google’s parent company, announced its subsidiary Wing would launch drone deliveries in Finland starting in June 2019 (pero). Wing has also been approved for commercial drone deliveries in the United States and Australia. In September 2019, Wing announced it would begin to test drone deliveries in Christiansburg, Virginia, teaming up with FedEx and Walgreens to provide deliveries of customer packages and over-the-counter medicines as well as food and beverages (bhagat_2019).

For SDD, drones provide fast and direct deliveries as they are not required to follow urban road networks. However, existing drones can usually carry only one item at a time and require regular charging or battery swaps. As a result, drones may not entirely replace vehicles in last-mile delivery, especially when the volume of customer requests is high (wang2016). Further, recent studies show that there are benefits to combining fleets of vehicles and drones for SDD (drones). However, the challenge arises how to effectively exploit the strengths of the individual vehicle types (e.g. capacity, speed) to offer service to as many customers as possible per day.

In this paper, we address this challenge for the same-day delivery problem with a heterogeneous fleet of vehicles and drones (SDDPHF). In this problem, over the course of a day, vehicles and drones deliver goods from a depot to customers and then return to the depot for future dispatches. The vehicles and drones differ in their travel speeds, capacities, and the need for charging or battery swaps. Customers are unknown until they make requests, and each request is associated with a delivery deadline. For every request, the dispatcher must determine whether the request is accepted and, if so, whether a vehicle or drone will make the delivery. All accepted requests must be delivered within a certain period of time. The objective is to maximize the expected number of customers served.

This problem is complex because each decision impacts the availability of drones and vehicles to serve future requests and because customers are waiting for a response, decisions need to be made instantaneously. To overcome these challenges, we propose a deep Q-learning approach. Q-learning is a form of reinforcement learning that seeks to learn the value of state-action pairs. Deep Q-learning uses a deep neural network (NN) as approximation architecture. Because the NN can be trained offline, the method can be employed for real-time decision making.

A key feature of any reinforcement learning is feature selection. Features are the input to the approximation. A large number of features increases the dimensionality of the NN and can increase training time and approximation error while a small number of features may fail to represent states adequately resulting in poor decision making. To identify appropriate features for our problem, we analytically identify elements of the state that effect the shape of the value function as well as that influence decisions of offering service. We then computationally demonstrate that the combination of these features provides superior decision making.

We compare the proposed approach to high quality benchmark policies from the literature. We also develop new methods for this paper that seek to improve the quality of the benchmark from the literature by adding new features to it. Our computational results demonstrate the proposed Q-learning approach provides higher quality solutions than all of the benchmarks.

This research makes several important contributions to the literature. It presents an effective and fast approach to an important and emerging delivery problem. It is among the first papers to implement deep Q-learning techniques for same-day delivery and for dynamic routing problems in general. Our work highlights the potential of reinforcement learning techniques in dynamic vehicle routing. In addition, we show the importance the value of including features that reflect both resource utilization and the impact of action selection in the current state. The identification of these categories of features offer general guidance for feature selection in other dynamic routing problems.

The paper is organized as follows. In Section 2, we present the literature related to SDD and reinforcement learning for routing-related problems. Section 3 describes the problem, and Section 4

presents a Markov decision process model of the problem. In Section 

5, we introduce the deep Q-learning solution approach for the SDDPHF. Section 6 characterizes structure of the value function as well as action selection and uses these results to identify features used for the Q-learning approach. Section 7 describes the details of instances, implementation, and the benchmark policies as well as presents the results of our computational study. Section 8 closes the paper with conclusions and discussion on future work.

2 Literature Review

In this section, we present the literature related to the SDDPHF. We first review the literature related to the SDD and then present the existing applications of reinforcement learning in vehicle routing problems (VRP). For a general review of drone routing problems, we refer the reader to otto. It is worth noting that, most of the papers cited by otto do not consider the dynamism of the SDDPHF. In contrast to the work in this paper, most of the literature involving drone delivery assumes that the customers to be served are known a priori.

2.1 Same-day Delivery

The existing related literature on the SSD is limited, but is increasing as the service becomes more popular. The most closely related work is found in drones. Similar to this paper, drones

consider the SDD with a heterogeneous fleet of vehicles and drones. Their work is the first to consider a heterogeneous fleet in the context of SDD. It is also the first to investigate the impact of adding drones to a fleet of conventional vehicles for a dynamic routing problem. The authors introduce a parametric policy function approximation (PFA) approach to heuristically solve the assignment portion of the problem. They use a fixed travel time threshold to determine whether a customers should be served by a drone or vehicle.

The PFA policy presented in drones uses only the location of customers to decide whether to use a vehicle or drone to serve a given customer. Yet, there is considerably more available information that might help produce better decisions. This paper extends the work in drones and uses not only the location, but also other available information reflecting resources and demand. We demonstrate that this additional information greatly improves the performance of the PFA and other benchmarks. To do so, we turn to a different type of approximation, Q-learning—an approximation of the value of state-action pairs, which we implement using NN.

Also related is the work of liu in which the author considers on-demand meal delivery using drones. In the paper, the dispatcher dynamically assigns drones of different capacities to pick up food from restaurants and deliver to customers. The author first introduces a mixed-integer-programming model to minimize the total lateness for the static version of the problem and then uses heuristics to solve the dynamic problem. The problem is similar to the SDDPHF in that they both consider the dynamism of customer orders and involve the routing with drones. However, in the dynamic case, liu present a rolling-horizon approach, an approach that ignores the future when making current decisions. Our approach uses deep Q-learning to incorporate each decision’s current value as well as its value on the future. Further, in addition to an assignment (routing) decision, the dispatcher in the SDDPHF must also make an acceptance decision for each customer request, which makes the action space in the problem even larger than that in the problem studied by liu. Finally, the SDDPHF requires the routing of vehicles, further complicating assignment decisions.

grippa also study a version of the SDD in which deliveries are made by a fleet of drones. The authors consider a system of drones that deliver goods from the depot to customers and model it as a queuing problem. Performance of the policies with different heuristics are evaluated computationally. Because of the drones’ interaction with the vehicles and the possibility of consolidating multiple customers packages onto vehicles, the queueing approach proposed in grippa does not apply to the problem in this work.

Other literature in SDD presents anticipatory methods for single vehicle problems. klapp31 consider a dynamic routing problem of a vehicle traveling on a line, where the probabilistic information is used to anticipate future requests. klapp32 use an a priori-based rollout policy to determine the customers to serve and whether the vehicle should leave the depot for deliveries at the current time or wait for more customer orders coming in. ulmer45 consider a SDD problem in which the vehicle is allowed to return to the depot preemptively. The authors introduce an approximate dynamic programming (ADP) approach, where at each customer (or at the depot) a decision of whom the vehicle visits next is made using the information in the state. In contrast to klapp31, klapp32, and ulmer45, this paper focuses on not only multiple vehicles but a heterogeneous fleet. The methods proposed in klapp31, klapp32, and ulmer45 do not scale to the problem discussed in this paper.

azi and voccia consider using multiple, but homogeneous vehicles in dynamic routing problems. These papers solve the problems using multiple-scenario approaches (MSA). While effective, MSA requires real-time computation that can be challenging in the context of the large fleet and the many incoming requests that we consider in this paper. Our proposed method uses offline computation to learn the approximation and can provide instantaneous solutions in real-time.

dayarian consider the SDD with drone resupply, where a vehicle performs the actual deliveries of goods to customers, and a drone resupplies goods from the depot to the vehicle en route. The use of the drone resupply enables the vehicle to serve a new set of customers without the need of returning to the depot to pick up the goods, and thus more customers can be served before their delivery deadlines. ulmer_station consider the SDD with a fleet of autonomous vehicles, where the autonomous vehicles deliver the ordered goods from the depot to a set of pick-up stations. The authors introduce a PFA approach to minimize the expected sum of delivery times. The emerging business models studied in dayarian and ulmer_station complement the work in this paper.

2.2 Reinforcement Learning for VRP

In this section, we only focus on the literature that uses reinforcement learning for VRPs. We do not consider supervised learning approaches such as are discussed in

potvin, vinyals2015, and fagerlund2018

. Reinforcement learning (RL) is an area of machine learning that is often applied to Markov decision processes for problems in robotics, artificial intelligence, and signal control.

sutton2018reinforcement provide a general overview. There are two common types of RL algorithms: policy-based methods and value-based methods. Policy-based learning algorithms seek to optimize a policy directly. Value-based learning algorithms learn the values of being in particular states or of particular state-action pairs. We specifically use a deep Q-learning network (DQN). DQN was introduced by atari who demonstrate its ability to play Atari games with super-human performance.

There are a few papers presenting RL-work in dynamic vehicle routing, for example, toriello2014dynamic, perez2017, ulmer2017budgeting. For a recent overview, we refer to meso. This work usually draws on the concept of value-function approximation (VFA). VFAs approximate the value of post-decision states by means of simulations. The values are stored in approximation architectures, usually either functions or lookup tables. The main difference with Q-learning is that Q-learning considers the value of state-decision pairs versus just the value of a post-decision state. We show that, for the SDDPHF at least, Q-learning’s ability to take advantage of both state- and action-space information has an advantage over relying on just information available in the post-decision state.

Work on using NNs in reinforcement learning for VRPs is particularly scarce. madziuk provides an overview of recent advances in the VRP literature over the last few years, where the author points out that “it came as a surprise to the author that NNs have practically not been utilized in the VRP domain …."

Among the papers using RL with NNs applied to dynamic routing problems, the most closely related to this paper is chen2019. chen2019 introduce an actor-critic framework, a policy-based RL method, for the problem of making pick-ups at customers who make dynamic requests for service. The problem is similar to the SDDPHF in that both consider dynamic requests and customer locations that are unknown in advance. The papers differ in that chen2019 learn a policy for a single vehicle and then apply this policy to all vehicles. As a result, the policy that is learned does not account for the interaction among the vehicles. Our results show that accounting for this interaction leads to superior solution quality. In addition, chen2019 do not make decisions on whether or not to offer service to customers, but rather they allow unserved customer requests to expire.

In their appendix, reza discuss a problem in which a single vehicle serves dynamic requests. Requests not served in a specified time period are lost. The vehicle has a limited amount of inventory and must return to the depot to replenish once the inventory is depleted. Like chen2019, reza propose a policy-gradient approach. The problem studied by reza is different and has a much smaller state and action space than the problem studied in this paper. Likewise, kool uses a policy-gradient approach to solve a the stochastic prize collecting traveling salesman problem and also presents an application to a variant in which customer locations and demands can change. Again, given the different problems, the commonality between kool and this paper is the use of RL. Yet, the use of policy-gradient methods in chen2019, reza, and kool suggest a future opportunity to explore the value of policy-gradient versus Q-learning approaches for dynamic vehicle routing.

3 Problem Description

In this section, we present a formal description of the SDDPHF. Due to similarities in the problems, this problem description is similar to that found in drones.

Over the course of an operating period, a fleet of vehicles and drones deliver goods from a depot located in some area to customers. Customers in the area make dynamic requests for service. The location of the th customer is unknown until the request is made as is the time of the request. Every delivery request has a hard deadline . In other words, the delivery must be completed within units of time after accepted at . Once receiving a customer request, the dispatcher immediately decides whether to offer the service. Before doing so, the dispatcher first checks the feasibility of serving the request by vehicles or drones. A customer request is infeasible if none of the vehicles and drones can complete the delivery by its deadline. The request can otherwise be feasibly served. Infeasible requests are automatically denied service and then ignored thereafter. In addition, if a request can be feasibly served, the dispatcher must determine whether to offer service. If a customer request is accepted for service, the dispatcher needs to assign it to a vehicle or drone in the fleet to make the delivery. The objective is to maximize the expected number of customers served during the operating period. This objective reflects our desire to serve as many customers as possible on the premise that doing so generates revenue now but also goodwill that leads to future purchases. We assume the driver costs are fixed and omit them from our objective because the amount of time available for work is fixed.

3.1 Heterogeneous Fleet

In the SDDPHF, we consider a heterogeneous fleet of vehicles and drones that differ in their characteristics. First, vehicles and drones have different capacities. Vehicles can carry multiple packages so they can make deliveries to multiple customers in a route. Because drones can carry only one package at a time, they must return to the depot after each delivery. Vehicles are uncapacitated due to the small size of most of the delivery items and relatively low number of packages on most routes (amazon). The different capacities result in different loading times for vehicles and drones. It takes units of time to load a package onto a drone and units of time for a drone to drop off a package at a customer. For vehicles, a constant loading time is used regardless of the number of packages, and the drop off time at a customer is . Due to working hour regulations, vehicles have to return to the depot before . We assume a (potentially different) latest return time for drones of , for example, at time the warehouse closes. Thus, the operation period is .

In addition to their capacities, vehicles and drones also differ in their travel networks and speeds. Vehicles must follow the street network. Drones can travel between the depot and a customer in the Euclidean plane and travel at a faster speed than the vehicles. Thus, different functions are used to determine the travel time of vehicles and drones. The travel time between two points for a vehicle is given by and for a drone given by .

In contrast to vehicles, drones also have a limited battery capacity and require battery swapping or charging after a delivery trip. To recognize the charging need, we assume units of charging time (or time to swap the battery) are required for drones whenever they return to the depot from a customer. It is assumed that the fresh battery level is sufficient for delivering a package to any customer in the delivery area.

Drones are also subject to weight limits. However, in this paper, we assume all the goods that customers request are under that weight limit. It has been shown that 86% percent of the products Amazon delivers are weighted 5 pounds or less (amazon), and existing drones can easily carry packages weighing up to about 5 pounds over distances of up to 12.5 miles (ups_drone).

3.2 Assignments and Routing

Once a customer request is revealed, the dispatcher decides whether to offer the service. Only feasible requests can be accepted. If a feasible request is accepted, the dispatcher then decides whether to make this delivery by a vehicle or drone. To assign the package to a vehicle, the dispatcher must determine which vehicle and where in the selected vehicle’s route to insert the new request. Similarly, if the package is assigned to a drone, the dispatcher must also decide which drone. As a result, the action space of the problem is huge. We assume that the processes of preparing sending a parcel via drone or vehicle differ. Thus, the assignments of requests to a fleet type are permanent once made.

A vehicle’s (or drone’s) route becomes fixed once the vehicle (or drone) leaves the depot to make deliveries to the customers (customer) in this route. The vehicle must deliver all the loaded packages before returning to the depot. No pre-emptive returns to the depot are allowed. A vehicle’s (or drone’s) route that has not started is called a planned route and is subject to change in the case new customers are assigned to it. To this end, the dispatcher maintains and updates a set of planned routes for all the vehicles and a set of planned routes for all the drones in the fleet. Then, the set of all planned routes is denoted as .

3.2.1 A Vehicle’s Planned Route

For vehicle , its planned route contains the depot visits and a sequence of customers that are planned to be serviced by vehicle and is represented by

The first entry of a planned route represents the vehicle ’s next depot visit , the arrival time at the depot , and the time at which the vehicle starts to load packages for the assigned customers in the next tour. The difference between and is the time the vehicle spends waiting at the depot before it starts its next delivery tour. Following the first depot visit is a sequence of customers , , that are assigned to vehicle but not yet loaded. The last entry in a planned route represents the vehicle’s return to the depot at time .

Each customer in a planned route is associated with a location and a planned arrival time . A planned route is feasible if all arrival times at customers are not later than their deadlines. The difference between arrival times reflect loading, service, and travel times. In the SDDPHF, waiting at a customer is not allowed. It is worth noting that can never exceed because we enforce feasibility. When the depot visit occurs, the vehicle idles until a new tour is scheduled. If no new customers are assigned to the vehicle before the end of service period, then . If the dispatcher decides to serve new requests using the vehicle, the dispatcher updates the vehicle’s route plan to integrate the new customers. This update will change arrival and departure times.

3.2.2 A Drone’s Planned Route

A planned route for a drone is slightly different from that for a vehicle in that a customer entry must be between two depot entries due to the capacity of drones. For a drone , its planned route is represented by

Because there can be more than two depot entries in a drone’s planned route, we use index to represent the last entry, the second to last and so on. The difference between and represents the drone ’s charging time after the previous delivery and the time the drone spends on waiting at the depot before it starts to load the items for its next delivery tour. Finally, the waiting time is set to .

3.3 Illustrative Example

In Figure 1, we illustrate the SDDPHF for (in minute), an hour after the shift begins. At this time, we assume that the dispatcher receives a new customer request . The two panels in the figure describe the corresponding and states, which are formally described in Section 4. In this example, the depot is located in the center of the area. The fleet consists of a vehicle and a drone. We assume the vehicle travels on a Manhattan-style grid, and the drone travels in a the Euclidean plane. The vehicle needs 20 minutes to travel a segment. The drone travels in the Euclidean plane with twice the speed of the vehicle. Customer orders must be completed within 240 minutes after accepted. For both the vehicle and the drone, a loading time at the depot and a delivery time at a customer are both 10 minutes. The charging time for the drone is 20 minutes.

Figure 1: Decision and decision state

The panel on the left shows the status of the vehicle and drone before the dispatcher makes the acceptance and assignment decisions regarding . The vehicle is currently en route serving and then . Its planned route is . It will arrive at at , and then arrive at at . The vehicle then will leave at and return to the depot at . Thus, the arrival time of the first depot entry in the planned route is 220. The vehicle is planned to serve in its next route. Customer is accepted but not yet loaded because the vehicle has not returned to the depot to load the package for it. The vehicle plans to arrive at at and then return to the depot at . The drone is currently en route serving , with the planned route . Note, we round up the arrival and return times to integers for the drone. The drone will arrive at at , and return to the depot at . Due to the charging time, the drone will not load the new package until . It is then planned to leave the depot for and arrive at at . The drone will return to the depot at .

As the new customer makes a request, the dispatcher determines the feasibility of assigning the new customer to the vehicle and the drone. The delivery deadline of is . If the vehicle serves the new right after , then it will arrive at at . It does not satisfy the deadline of so this insertion is not feasible. Alternatively, if the vehicle serves first and then , it will arrive at at and then arrive at at . It meets the deadline of but violates that of . Because the SDDPHF does not allow rejections of requests once they are accepted, the alternative insertion is not feasible either. Overall, it is not feasible to serve by the vehicle. As for the drone, if it serves after it returns to the depot from the planned , then it will arrive at at . This is a feasible assignment because it satisfies the deadlines of and . Alternatively, if it serves first and then , then it will arrive at at and at . This alternative routing does not satisfy the deadline of so it is not feasible. Thus, it is feasible to serve with the drone because there exists a feasible route.

The dispatcher next makes the decision. Let us assume the dispatcher accepts the request and assign it to the drone. Then, the update is shown in the panel on the right in Figure 1. The vehicle’s planned route remains the same, and the drone’s planned route becomes .

4 Markov Decision Process Model

In this section, we model the SDDPHF as a Markov decision process (MDP). An MDP models a stochastic and dynamic problem as a sequence of states connected by actions and transitions. Due to similarities in the problems, this MDP model is similar to that found in drones.

Decision point. A decision point is a time at which a decision is made. In the SDDPHF, a decision point occurs when a customer requests service. We denote the customer request as the time of the decision point as .

State. The state at a decision point summarizes the information needed to make the decision. In the SDDPHF, the state at decision point includes time of the decision point, the customer request, and all the planned routes. Thus, we represent the state as a tuple , time of the decision point , the location of customer , and the set of planned routes . In the initial state , represents a vehicle’s (or drone’s) initial position at the depot and is because every vehicle (drone) is available once the shift begins.

Actions. In the SDDPHF, an action incorporates whether the request is accepted and, if so, which vehicle or drone will provide the service. We represent the action at a decision point as a tuple , where is the acceptance and assignment decision, and () is the updated set of vehicle (drone) planned routes given . In addition, () represents the updated set of customers planned to be serviced by vehicles (drones) but not yet loaded.

Before making the acceptance and assignment decision, the dispatcher determines the feasibility of serving a request by vehicles (drones). It is feasible to serve customer by vehicles if there exists an update satisfying the following six conditions:

1. The planned routes in contain all the customers in .

2. For every customer , the planned arrival time is not later than the deadline .

3. In each planned route , the start of loading at the depot is not earlier than the arrival time .

4. In each planned route, the difference between the beginning of loading for the next tour at the depot and the arrival time at the next customer is the sum of travel time and loading time.

5. In each planned route, the difference between the arrival times of two consecutive customers is equal to the sum of travel time and service time.

6. The vehicles must arrive at the depot before the end of the shift, .

Similarly, it is feasible to serve by drones if there exists an update satisfying the following six conditions:

1. The planned routes in contain all the customers in .

2. For every customer , the planned arrival time is not later than the deadline .

3. In each planned route , the start of loading at the depot is not earlier than the arrival time .

4. In each planned route, the difference between the beginning of loading for the next tour at the depot and the arrival time at the next customer is the sum of travel time and loading time.

5. In each planned route, every customer entry must be in between of two depots visits.

6. In each planned route, the arrival time of the last depot visit is not later than the end of the shift, .

Note, when the dispatcher determines the feasibility for vehicles (drones), there can be more than one feasible update from which to choose. We will discuss the heuristics that we use to decide the update () in Section 5.3.

Given the feasibility of serving a customer with a vehicle or a drone, the dispatcher then makes the acceptance and assignment decision , which is defined as:

Infeasibility automatically leads to the denial decision, while feasibility does not guarantee the acceptance. For example, if it is not feasible to serve a customer by any vehicle or drone, then the dispatcher automatically denies service. If it is feasible to serve a customer by vehicles and infeasible by drones, then the dispatcher will not consider any drone to provide the service. The dispatcher can still decide not to offer the service to the customer even if vehicles are available.

Reward. Given the state at decision point , the reward of an action is

Transitions. There are two types of transitions involved in the SDDPHF. The first type is from pre-decision to post-decision states and is determined by the acceptance and assignment decision . The second type is from post-decision to the next pre-decision states and is determined by exogenous information.

After is taken at , pre-decision values , , and are updated to post-decision values , , and to reflect the effect of the decision. The update works as follows:

  • If the request is not accepted, , , and , and is ignored thereafter.

  • If the customer is assigned to a vehicle, , and . Then, is updated to the selected that is obtained when the dispatcher determines feasibility.

  • If the customer is assigned to a drone, , and . Then, is updated to the selected that is obtained when the dispatcher pre-calculates the feasibilities.

After the post-decision values are updated, the vehicles and drones proceed with their planned routes until a new customer request is received at . When the new request is revealed at , the transition from the post-decision state to the next pre-decision state takes place as one of the following situations:

  1. If there are no vehicles or drones returning to the depot between and , the planned routes stay the same as in the previous post-decision state, .

  2. If a vehicle (or drone ) returned to the depot between and and already started the next tour, then the customers placed in the ongoing tour are removed from (or ). The resulting (or ) contains the information on the next depot return only.

  3. If a vehicle (or drone) returned to the depot between and and is currently waiting at the depot (), then is set equal to .

  4. If a vehicle (or drone ) finished servicing all the customers in both ongoing and planned routes between and , then the route is set to .

The MDP terminates in state with all the planned vehicle and drone routes being .

Objective. A solution to the SDDPHF is a policy that assigns an action to each state. The optimal solution is a policy that maximizes the total expected reward and can be expressed by

5 Solution Approach

In this section, we present our solution approach. It is well known that the solution to an instance of an MDP model can theoretically be found using backward induction applied to the Bellman equation:


where is the set of actions available at state .

As with many other dynamic routing problems, however, the problem studied in this paper incurs the “curses of dimensionality" for the SDDPHF, most often a state space too large to even enumerate. In this paper, we also encounter a very large action space. Notably, the dispatcher must not only determine whether to offer service, but also to which fleet to assign accepted customers and how to route any customers assigned to vehicles. Thus, we propose a heuristic. In the following, we give a conceptual overview and an outline of our heuristic. We then describe the components in detail.

5.1 Motivation

For the SDDPHF, decisions should be fast and effective. Decisions should be made fast because customers expect immediate feedback about their service requests. They should be effective by accounting for immediate and expected future revenue. To satisfy both requirements, we draw on methods of reinforcement learning (sutton2018reinforcement). The idea is to learn the value of a state and decision by means of simulation. This simulation is done “offline” in a learning phase. The learned values can then be accessed within the “online” execution without any additional calculation time required. Because the size of the action space would prohibit even offline learning, we also draw on a runtime-efficient routing heuristic, reducing the action space to . The reduced action space does not require a routing decision but only the choice of whether to provide service and by what fleet.

Learning values is challenging because of the enormous sizes of state and decision spaces. For the SDDPHF, multiple vehicles and drones are routed to serve many customers. Storing the value for every potential state and action combination not only leads to substantial memory consumption, it also makes frequent value observations and therefore learning impossible. Thus, we reduce the state space to a set of selected features and approximate the values for each feature vector by means of deep Q-learning. Figure

2 presents a flow chart of the process. In a pre-decision state, we first check feasibility for drones and vehicles by means of the routing heuristic. If serving the customer is generally infeasible, no service is offered. Else, dependent on the drone and vehicle feasibility, we extract state features and evaluate the state and corresponding routing decision provided by the heuristic. Based on the evaluation, we assign the customer to a drone or vehicle, or we do not offer service to the customer at all.

Figure 2: Conceptual Process of Decision Making.

Two main challenges arise from the process: what features to extract and how to evaluate the value of a specific state-decision pair. For the first, we will present a selection of features based on analytical considerations in Section 6. For the latter, we draw on Q-learning as discussed in the following. We denote our policy .

5.2 Deep Q-learning

In this section, we introduce the deep Q-learning solution approach and structure of the NN used for the SDDPHF. Q-learning is a reinforcement learning approach that learns the value of taking an action in a given state. Thus, Q-learning learns a value for each state and action , and this value is an approximation of making decision in state . Given that we operate on the restricted action space , we learn a value for each state and action . With these Q-values and the reduced action space, we can solve an approximation of Equation 1 that can be written as


Depending on the result of the feasibility check, we have three potential decision sets. One set contains three decisions (drone, vehicle, no service), and two sets contain two decisions (vehicle/drone, no service). For each of the potential sets, we create an NN to approximate the value of the corresponding decisions. This set of networks are denoted as , where each represents the set of weights or parameters in the corresponding NN. In the following, we present the structure of the NNs and the procedure to train them.


An NN is characterized by its input layer, hidden layers, and output layer. In our approach, each of the three NNs has the same structure:
Input layer.

The input layer receives the features extracted from the state and passes them to the hidden layers. Each of the three NNs uses the same features, and we present the features in Section


Hidden layers. Each of the corresponding three NNs has hidden hidden layers and nodes

nodes in each hidden layer. We use the rectified linear unit function (ReLU) as the activation function for each hidden layer.

Output layer. The output layer of the NNs in the SDDPHF outputs the Q-values for each possible action for a state . Because the networks approximate real-valued future rewards, there is no activation function on the output layer.

To determine the best model for the NNs, we test different numbers of hidden layers for each NN, , and different numbers of hidden nodes, for each layer. The computational results show the combination of and outperform the others.


We learn the parameters of the NNs in a training phase. Training is performed by sampling with replacement from a set of

sample paths for each instance. Each sample path represents a simulated day. Within the simulation, we make decisions using the policy obtained from the current NNs while also occasionally making random decisions to allow for further exploration of the state space. We provide more details of the exploration later in the section. Each sample path is a training step of the NNs. Thus, we update the NNs after each sample path with a batch of the new observations and previous observations, a practice known experience replay. We provide further details later in this section. We use the mean-squared error as the loss function and minimize it using the well known

Adam optimizer (adam), a stochastic gradient-based algorithm. As a result of experimentation, for weight updates, we use a learning rate that exponentially decays from with the base and the decay rate .

Figure 3: Comparison of solution quality curves with and without experience replay

Experience replay. In training the NNs, we implement experience replay (lin). The goal of experience replay is to overcome the correlations between successive states in an episode (a day in the SDDPHF) and the similarities between different episodes. In the SDDPHF, an experience tuple is a state-action-reward tuple . In our training, we create an experience buffer for each NN that stores such tuples. For each NN, we randomly sample a mini-batch of tuples from the experience buffer for each update of the weights.

In Figure 3, we present comparison of the learning curves with and without experience replay for an instance later presented in the computational study ( homogeneously distributed customers, vehicles, drones). In each plot, we present the solution quality curve and the best solution found in the first training steps. Overall, in the same training steps, the solution quality of the DQL with experience replay is () served customers more than that without experience replay. Without experience replay, the algorithm barely learns in the first

steps. The variance of the curve is not reduced until after about

steps. However, with experience replay, the DQL quickly learns and shows a promising convergence and reduced variance as the training progresses.

Exploration and exploitation. During training, we usually select the decision that maximizes the total expected reward associated with a state. This is known as exploitation. However, it is well known that occasionally randomly selection a decision improves the quality of the approximation (powell2011approximate, Ch. 12)

. These random selections are known as exploration. We set the probability of exploring

and exploiting , where decays from to over the training steps.

5.3 Routing and Assignments

As discussed at the start of the section, to overcome the large action space associated with the SDDPHF, we heuristically route customers and heuristically assign them to drones. We assign customers to drones in a first-in-first-out (FIFO) manner. We prioritize drones idling at the depot and assign a new customer to an idling drone prior to one en route. If all the drones are en route, we assign a new customer to the drone with the earliest availability. We arbitrarily choose a drone if there is a tie. It is not feasible to service a customer by drone if no drones can deliver the goods by the customer’s delivery deadline.

To route customers on vehicles, we implement an insertion heuristic, which in this case is an extension of the heuristic presented in azi. Our heuristic works as follows. If there is a vehicle currently idling at the depot, we assign the new customer to it. If no vehicles idle at the depot, we go through all possible insertions in each vehicle’s planned route. An insertion (update) is feasible if it satisfies the six conditions described in Section 4. For every feasible insertion, we then calculate the increase in the tour time of the vehicle to which the new customer is inserted, denoted Vehicle. We assign the new customer to the vehicle with the insertion that minimizes the Vehicle. If there is no feasible insertions for any vehicle, then it is not feasible to serve the new customer by fleet of vehicles.

6 Features

In this section, we present the features in the DQL for the SDDPHF. We first motivate the choice of features by analytically examining the functional form of the Bellman equation as well as several situations in which we can characterize optimal action selection. We then use these results to identify features appropriate for this problem.

6.1 Analytical Results

In this section, we present a series of analytical results for the SDDPHF. The proofs are presented in Appendix A.1. We begin by characterizing Equation (1). Our goal is to identify how the value function reacts to changes in the state. First, we show that Equation (1) is monotonically decreasing in time, the return time of the vehicles, and the return time of the drones. The results are formalized in Propositions 1, 2, and 3.

Proposition 1.

In the SDDPHF, let represent time of the decision point in the state. Then, the expected reward is monotonically decreasing in .

Proposition 2.

Suppose, in the SDDPHF, the fleet has one vehicle. Let represent the time at which the vehicle returns to the depot from the current route. Then, the expected reward is monotonically decreasing in .

Proposition 3.

Suppose, in the SDDPHF, the fleet has one drone. Let represent the time at which the drone returns to the depot from the current route. Then, the expected reward is monotonically decreasing in .

We next seek to characterize circumstances in which we can identify optimal actions. Our goal is to identify information that is needed to determine these optimal actions and to make sure that this information is available as a feature. Importantly, we analytically demonstrate the value that not offering service to a particular customer can have on the objective. Appendix A.3 offers a computational demonstration.

Because the structure of the SDDPHF is complex, we simplify the problem for this analysis. We assume that customers dynamically make delivery requests over . Customer requests are revealed following a Poisson process with rate . Each requesting customer’s distance from the depot

follows a uniform distribution with the support

, where is the maximum possible distance. The distance between a customer and the depot is converted to the corresponding vehicle’s travel time. In the remainder of this section, we use the term “distance”

synonymously with travel time of the corresponding fleet type. The arrival times of requests are independent, and the random variables, time

, and distance are also independent. The fleet consists of a vehicle and a drone. We omit the loading and charging times for the fleet. The capacity of the vehicle is unlimited, and the capacity of the drone is 1. We assume that the drone travels () times as fast as the vehicle.

We first consider the situation at the end of the day. Proposition 4 identifies the circumstances in which a drone idling at the depot can serve at least one more customer before the end of the day.

Proposition 4.

Suppose a drone is available when a new customer that is (vehicle travel time) units from the depot requests service at time . Then, we can serve this customer if .

We now determine whether we should serve an end-of-the-day customer request with an idling drone. Assume the vehicle is in its last route and will return to the depot at . The drone can feasibly serve the customer at time as given in Proposition 4. To determine whether to serve this request at , we want to know how many requests the drone is expected to serve during . Importantly, it is possible that, instead of serving the current request, the drone could serve several customers whose request arrive after but that are close to the depot.

To address this question, we calculate the probability that the drone can serve at least one more customer after serving . We present the probability in Lemma 1.

Lemma 1.

Suppose that the drone is available when a customer request that is located (vehicle travel time) units from the depot that is revealed at time . If the drone is dispatched to serve the customer, then the probability that the drone can serve at least one more customer after returning to the depot is


Instead of serving the current customer request arriving at time , the dispatcher can choose not to offer the service to customer. We would make such a decision if by doing so we would expect to serve more customers thereafter. The probability that the drone can serve at least one customer is given in Lemma 2.

Lemma 2.

The drone is available when a customer request units from the depot is revealed at . If the dispatcher does not offer service to the new customer, then the probability that the drone can instead serve more than 1 customer thereafter is


Using Lemmas 1 and 2, Proposition 5 identifies when the dispatcher should not accept a feasible request.

Proposition 5.

Assume that the drone is available when a customer request that is units from the depot is revealed at . Let be the travel time that equates the probabilities in Lemmas 1 and 2. The dispatcher should always accept and assign the feasible request to the drone if , and not accept it if .

For three values of , Figure 4 illustrates Proposition 5 for parameter settings , , , and . The horizontal axis represents the possible distance of the new feasible customer, and the vertical axis is the probability. The probability in Lemma 1 (black dots in Figure 4) can be seen as that of achieving immediate reward and at least future reward. The probability in Lemma 2 (blue dots) can be seen as that of achieving at least future rewards.

Figure 4: The probabilities vs. the distance of the customer request

As shown in the left-hand figure, given these parameter settings, there is a time before the horizon at which we always accept the customer. When it is long enough before the end of the horizon, is much greater than so the two probabilities do not intersect in the plot. The probability that the drone can serve at least one more customer after serving the current one is always higher than the other one. In this case, the dispatcher should always take the immediate reward. As it is closer to the end of horizon, , the probability of achieving a future reward of goes down. The two probabilities intersect at about . In this situation, the dispatcher should accept the feasible request if is between and approximately . Otherwise, the dispatcher should consider not accepting the request, because as larger becomes larger, the chance that we cannot serve any more customer thereafter grows. At when it is even closer to the end of the horizon, both probabilities are relatively low. In this situation, we observe the value of is smaller than that for . As shown in this illustrative example, the threshold concerning denial decisions should incorporate the time in a shift.

In the case the drone is unavailable for the rest of the day, and the vehicle is idling at the depot, we can derive results analogous to Proposition 5. In the case of the vehicle though, we need to consider the cost of inserting a customer in the vehicle’s route rather than just the travel time from the depot. We denote this cost as , the increase in the vehicle’s tour time that results from inserting the customer to the vehicle’s planned route. Due to their different capacities, the times at which the drone and the vehicle are planned to return to the depot are determined by different quantities. The drone has a capacity one so it must return to the depot after every delivery. The vehicle can serve multiple customers in a route. If the dispatcher assigns a new request to the vehicle, the time at which it is planned to return is postponed by , the insertion cost of this new request. We wanted to develop the proposition for the vehicle similar to Proposition 5. However, due to the unlimited capacity of the vehicle, we expect it is far more complicated and thus not practical to do so. Because of the inter-dependencies of decisions, we were not able to provide a straightforward proof.

Although the SDDPHF is way more complicated than the simplified version, we can still use the similar logic for the analysis of the SDDPHF. For example, when the delivery resources become limited, the dispatcher should consider offering no service to some feasible requests in exchange for a higher expected reward in the future. We take the logic in this analysis as motivation to select the features for the SDDPHF.

6.2 Features

Using the analysis in the previous section, we can identify information that needs to be extracted from the state that is to be input into the NNs. The features comprise information about the time, the customers, and the fleet. Our DQL approach uses all the features presented below. Each of the features extracted from the state is normalized using the min-max normalization before being input to the NNs.

Time. Given a state , the first feature is the time of the decision point . Proposition 1 shows that the value function is monotonically decreasing in time and the point of time of a decision point should help determine the Q-value. Computational results presented in papers such as marlin have also demonstrated the value of the point in time for same-day delivery problems.

Fleet. To reflect the availability of resources in the state at a decision point, we include as features the time at which the vehicles and drones return to the depot from their ongoing routes. As shown in Propositions 2 and 3, we know that the value function is monotonically decreasing in these values. Suppose the fleet consists of vehicles and drones . Then, the available time for vehicle is indicated in its planned route as described in Section 3.2.1. Similarly, for drone , its available time is indicated in .

Actions. We also include two features that capture the resource consumption that would occur if the customer request is accepted. These features also incorporate information about the requesting customer. The first feature is the distance between and the depot . Proposition 5 shows that this distance plays an important role in determining whether or not to assign a customer to a drone. However, the feature is not suitable for making decisions concerning the assignments to vehicles. To this end, we also consider a second feature Vehicle, as introduced in Section 6.1. As discussed, this value plays an important role in determining whether or not a customer should be served by a vehicle.

As an example of feature extraction, again consider the illustrative example described in Section 3.3. A decision point is triggered by customer making a request. The features in the pre-decision state are extracted. In this pre-decision state, time of the decision point is . The distance of , converted to the corresponding drone travel time, is minutes. The is a large constant (e.g., 10000) because it is not feasible to serve by the vehicle. The vehicle’s available time is , and the drone’s available time is . Thus, before normalization, the features can be summarized as a tuple .

6.3 Impact of Feature Selection

To demonstrate the value of our feature selection, we will briefly illustrate how our policy performs for different subsets of features. We use the same instance setting as before.

Figure 5: Solution quality curves with different features vs. DQL (3 vehicles 10 drones)

We show four different feature sets with additional sets shown in Appendix A.2. All sets contain the point of time of the current decision. The first set, whose results are shown in the top left, contains the features proposed in this paper, the DQL-features reflecting the state and action spaces. The action space is represented by the distance to indicate the travel time when sending a drone and the additional travel time when assigning a vehicle. The state space is reflected by availability times for all drones and vehicles. The second subset, whose results are shown are shown on the top right, contains similar features, but focuses on the directly affected vehicle and drone, ignoring the state of the rest of the fleets. This allows to provide a comparison to the approach of chen2019 who do not consider fleet interactions in their approach. The third set, whose results are shown on the bottom left, contains features that reflect only the action space of the problem (distance, additional travel time). This set therefore ignores the state information about the utilization of the fleets.

Q-learning considers the value of a state-action pair. As a fourth set of features, we consider a different approach to VFA. We use the post-decision state, and the results are shown on the bottom right. The post-decision state represents the state immediately following action selection, but before the realization of new exogeneous information. In this case, the post-decision state includes the point of time as well as the availability times of drones and vehicles that result from an assignment or a rejection of a request. We evaluate the ability to learn the value of post-decisions states because post-decision state VFA is common in the literature (see marlin) and particularly because post-decision state VFA is used in drones, a paper that also looks at same-day delivery but does not consider a heterogeneous fleet.

In each plot, we plot the solution quality curve for the first 400,000 training steps. The horizontal axis represents the number of training steps, and the vertical axis represents the solution quality. The horizontal line represents the best found solution value.

We observe that the the proposed feature selection outperforms all other selections. Sets two and three show substantially worse solution quality and limited learning. The fourth set shows a constant, but slow learning process. Even after 400,000 training runs, the approximation has not converged. We note that the VFA can access the action space features distance and travel detour implicitly when comparing the different assignment decisions. However, using the features explicitly as input in the DQL such as we do leads to more effective approximation and better solution quality, at least for this number of training steps. The action space features are likely to guide decision making, especially in early training runs when the approximation is still weak. This indicates that enriching a value function approximation with action space features may be beneficial for large-scale problems such as those often observed in dynamic vehicle routing.

7 Computational Study

In this section, we present the computational study for the SDDPHF. We first present the instance settings for the computations. We then describe the benchmark policies. We compare the policies for a variety of instance settings and finally analyze decision making in detail.

7.1 Instance Settings

Due to the similarities of the problems, we work on the instances provided by drones. These instances assume delivery requests are made from am to pm and that vehicle drivers work eight hours from am to pm thus hours. The drones are available from am to pm thus hours. We assume that accepted requests must be serviced within hours. The loading and service times for both vehicle and drones are each minutes ( minutes). The battery charging time required for drones is set to

minutes, which is a conservative estimate based on discussion in


The vehicles travel at a speed of km/h. We assume vehicles travel on a road network. To reflect the effect of road distances and traffic, we transform Euclidean distances using the method introduced by boscoe. The method transforms Euclidean into an approximate street-network distance by multiplying the Euclidean distance between two points by . We compute the travel time based on these transformed distances.

We assume drones travel in a point-to-point fashion at a speed of km/h. This speed is a conservative estimate of drone capability accounting for security measures within the city (pero). As described in Section 3.1, we assume drones are capable of delivering all packages in the SDDPHF.

We assume that customers make delivery requests according to a homogeneous Poisson process with requests expected in each day. For customer locations, we use two geographies. In the first geography, the and

coordinates of customer locations are generated from independent and identical normal distributions with the depot at the center. This geography reflects the structure of many cities in Europe where most customers are in the central area and the rest are sparsely located in the suburb. We set the standard deviation to

km for each coordinate resulting in % of the customers being in a core that is within 10-minute vehicle travel time from the depot (approximately km) and about % of the customers within minutes (approximately km). This distance is within the travel capability of existing drones, which can carry payloads weighing up to pounds and travel up to miles (km) (ups_drone). Thus, in the SDDPHF, we assume drones are capable of flying to any customer and then returning to the depot without charging en route.

The second customer geography is heterogeneous over time. This geography is motivated by the idea that, in the beginning of the day, customers often order to their homes, in the middle of the day, more customers order to work, and later, more customers order to their homes again. Thus, we vary the standard deviation of the customer coordinates. In the first and last two hours, it is km as before. For the three hours in between, it is reduced to km.

For the two geographies, we test nine different combinations of fleet sizes. We consider combinations , , and vehicles with , , and drones.

7.2 Benchmarks

In this section, we introduce the policies that we use to benchmark the performance of our DQL approach. In the main body of the paper, we present comparisons to only three benchmarks. Additional benchmarks are introduced and analyzed in the Appendix. We first briefly review the PFA approach introduced in drones and then introduce two additional PFA variants. We do not address the other policies covered in drones because the PFA dominates them. Particularly, the PFA was shown to dominate policies that sought to prefer vehicles in assignments and to prefer drones in assignments.

As discussed in Section 2.1, drones introduce a PFA approach to solve the SDDPHF. The authors consider a policy that incorporates the intuition that drones are suitable to serve the customers that are farther from the depot and vehicles serve those that are closer. The PFA policy is parameterized by a vehicle travel time threshold . That is, the only feature that uses is . The policy works as follows. When customer makes a delivery request at , the dispatcher checks if it is feasible to serve the customer by either a vehicle or drone and then makes a decision regarding the acceptance and assignment. Customer is automatically not offered service if no vehicles or drones are able to make the delivery. When there is only one fleet that can complete the service before the customer’s delivery deadline, the customer is assigned to that fleet. When both fleets can feasibly serve customer , the vehicle travel time between and the depot is compared to the threshold . Request is assigned to a vehicle if it is within the threshold and to a drone otherwise. Feasible requests are always accepted.

We also consider a similar policy to that allows the rejection of feasible customers. We call this policy . In policy. , when a customer is infeasible with regard to the fleet designated by the threshold, the customer is not offered service. For both policies and , we learn in the manner described in drones.

To take advantage of alternative state information, we also consider a PFA-based policy Delta that is parametrized by an insertion-cost threshold for vehicles. As described in Section 5.2, every new customer is associated with a potential insertion cost Vehicle if it is feasible to serve the customer by vehicles. One of the goals of using drones in SDD is to discourage vehicles from traveling to remote areas because such dispatches can be costly. The Delta is designed to avoid these less preferred decisions by controlling the insertion cost to be within a threshold . Given a new customer request, the policy checks feasibility for vehicles and the corresponding insertion-cost Vehicle. If service by a vehicle is feasible and , the customer is assigned to the vehicle. Otherwise, the policy selects delivery by drone if feasible, or no service is offered. Parameter is determined by enumeration.

7.3 Solution Quality

We use solution quality as the measure of performance of different policies. For each policy , we define its solution quality as the average percentage of served orders:


To compare the performance of different policies, we define the improvement of over as:


For varying fleet sizes, Tables 1 and 2 summarize the solution quality of Q and the benchmarks. For PFA, we present the average number of customers served. We then use PFA as a benchmark and report the other policies’ performance relative to PFA. We also perform a paired-sample -test with PFA. In the tables, the mark * indicate that the p-value of a paired-sample -test is less than %.

Fleet Size
(Veh, Drone)
2, 5 227.6 5.8* 18.7* 22.0*
2, 10 312.8 0.5* 8.8* 10.7*
2, 15 391.1 -1.4* 4.4* 6.8*
3, 5 293.4 3.5* 13.5* 16.7*
3, 10 376.2 -0.3 6.5* 9.2*
3, 15 460.3 -3.7* 0.3* 3.0*
4, 5 354.7 1.9* 9.0* 11.0*
4, 10 439.9 -2.3* 2.7* 3.6*
4, 15 499.6 -2.2* -0.9* 0.0
Table 1: Improvements (%) over (500 expected homogeneously distributed customers)
Fleet Size
(Veh, Drone)
2, 5 255.2 9.5* 14.4* 21.9*
2, 10 349.9 2.1* 4.9* 10.2*
2, 15 449.4 -3.8* -2.5* 3.2*
3, 5 336.2 9.0* 10.9* 16.6*
3, 10 441.6 -0.4* 0.7* 5.2*
3, 15 498.0 -1.0* -1.4* 0.0
4, 5 425.2 2.6* 4.5* 6.5*
4, 10 497.7 -1.2* -1.2* -0.1*
4, 15 499.2 0.0 0.0 0.0
Table 2: Improvements (%) over (500 expected heterogeneously distributed customers)

Overall, outperforms the benchmarks. The only instance in which it does not is that with heterogeneously distributed customers, vehicles, and drones. In that case, outperforms . Yet, in that instance, the difference is small and both and serve almost all of the customers. The value of is greatest in the cases in which resources are more constrained. For example, with vehicles and drones, can only serve about () of homogeneously (heterogeneously) distributed customers. The policy improves the solution quality by more than in both cases. As resources become less constrained, the relative performance of Q diminishes. Simply, with abundant resources, there is sufficient slack to overcome the poorer decisions of the benchmark policies.

With the same size of the fleet, all policies serve more customers in the heterogeneous instances than those in the homogeneously distributed instances. This difference results from the fact that on average, customers for the heterogeneous distribution are closer to the depot and therefore easier to serve.

Another interesting observation is the relative performance of policies and . For instances with only a few drones, outperforms . This changes when the number of drones increases. Recalling the analytical considerations in Section 6, utilizes a drone-centric feature of direct travel time to the customer while draws on the corresponding, but vehicle-centric feature of route duration increase. Thus, by shifting the fleet composition from vehicles to drones, the performance advantage shifts from to as well.

7.4 Illustration of the Decision Making of the Various Policies

In this section, we graphically illustrate the differences in the policies of the proposed Q-learning approach and the benchmark policies. To demonstrate these differences, we select an instance (a day) on which different policies are evaluated. Then, for each policy, we plot the acceptance and assignment decisions of each customer throughout the day. We consider an instance that has expected and homogeneously distributed customers and a fleet of vehicles and drones. For this selected instance, policies serve slightly more customers than on average with policy serving 394, serving 396, serving 395, and serving 444 customers.

Figures 6-9 illustrate the served customers and how they are served. The horizontal axis is the time in minutes ranging from to the latest possible order time , and the vertical axis represents the vehicle travel time needed by a vehicle to service a given request where the travel time is based on the customer’s distance from the depot. Each dot in the figures represents a customer order whose coordinates on the plot are determined by when and where they make the request and whose color is depends on the decision made by the corresponding policy, service by vehicle, drone, or no service.

Figure 6: Time vs. travel time vs. decision under on the selected instance.

Figure 6 presents the decisions for the policy . This policy uses the vehicle travel time threshold to choose between vehicles and drones. The figure shows that, in the interval , both vehicles and drones can feasibly serve customers. However, starting around , vehicle capacity becomes limited because of existing assignments, and vehicles are unable to meet the delivery deadlines of new requests. Thus, the policy starts to assign customers close to the depot to drones. This result follows from the fact that will assign customers that cannot be served by vehicles to the drone fleet if such an assignment is feasible. As a result, the vehicle travel time threshold vanishes over the second half of the day, and both drone and vehicle capacity becomes limited. Eventually, a relatively large number of customers are left unserved.

Intuitively, a good policy serves as many closer customers as possible because the cost of traveling to them is relatively low. Yet, consider customer that is close to the depot and orders in the middle of the day. This customer does not get served because both drone and vehicle capacity is consumed. This occurs because of customers like that customer who is close to the depot and assigned to a drone. Such an assignment does not make an efficient use of the drone because the relatively long setup and charging times outweighs the travel speed advantage for customers close to the depot. This example illustrates the shortcomings of the policy serving customers whenever it is feasible for any fleet type.

Figure 7: Time vs. travel time vs. decision under on the selected instance.

Figure 7 presents an illustration of the decision making of policy . In this case, the policy has a hard threshold . Because the policy strictly obeys the threshold, we observe an explicit horizontal line throughout the day. In rigidly maintaining the threshold, we also see that the policy does not provide service to a number of customers over the second half of the day. Yet, in determining which customers do not receive service in a more controlled way than the policy , the policy serves more customers than in this selected instance and even about on average for the instance setting.

Figure 8: Time vs. travel time vs. decision under on the selected instance.

Figure 8 presents an illustration of the decision making of . Because decides whether to offer service to a customer on the insertion cost, no threshold is visible. While the policy performs well relative to the other benchmarks, we can see times when the rule-based decision making leads to less desirable decisions. Consider the assignments near time 100. At this point, the vehicles end up serving customers that are relatively far from the depot. The threshold-based policies would have controlled the farther away customers from being added to the routes and thus less efficient use of the vehicles. The result is that a series of relatively closer customers are assigned to drones in the time interval . Then, just after time 200, a number of requests are denied service because both vehicle and drone capacity have been consumed.

Figure 9: Time vs. travel time vs. decision under on the selected instance.

Figure 9 illustrates the decision making of policy . In the selected instance, demonstrates significant improvements over all the benchmarks, ranging from (over ) to (over ). Although does not operate on any kind of threshold, Figure 9 indicates an emergent time-dependent threshold that results from the learned Q-values. In the beginning of the day when the delivery resources are sufficient, assigns most customers within the threshold (about minutes) to vehicles and distant customers to drones. During about , when vehicle capacity is mostly consumed, shows a slightly diminishing threshold, maintaining the availability of vehicles. Unlike the benchmarks, even when drones can feasibly serve closer customers during (refer to Figure 11), does not assign them to drones. In fact, it is true for the whole day except customer at the very end of the day. This exception can be explained by Lemma 2. Because the time of the decision point is one of the features, recognizes that this request is made nearly at the end of the shift. Rather than offering no service to the customer, takes the immediate reward and assigns it to a drone because the probability of serving at least customers in the future if not serving the current one is relatively low. Remarkably, the DQL self-learns the policy without any explicit knowledge of the lemmas and propositions.

Figure 10: Time vs. travel time vs. decision under on the selected heterogeneous instance.

While this analysis focuses on the homogeneously distributed customers, we see a similar behavior for the heterogeneous distribution. Figure 10 shows the results for the heterogeneous distribution and policy . The corresponding figures for the benchmark policies can be found in the appendix. We observe that the solution structure is similar, customers more distant from the depot are preferably served by drones. However, when the distribution changes around time 120, the distance changes as well relative to the customer locations.

Figure 11: Customer feasibility under (from top left to bottom right) , ,,

Another way to examine the difference in decision making among the policies is to examine the infeasibility of customer requests. Figure 11 shows which customers are infeasible for vehicles and for drones. The horizontal axis represents the time of decision points. The vertical axis represents the infeasibility of requests by each fleet. Each dot represents an infeasible customer request.

Figure 11 shows no significant difference in the availability of drones under different policies. On the other hand, the policies exhibit very different patterns of feasibility with respect to vehicles. The policy maintains vehicle feasibility for almost the entire day, with very few infeasible ones at the end of the day. This outcome highlights the relatively greater value of the vehicles that results from their greater capacity and ability to insert customers onto routes, avoiding the need to go back and forth to the depot like drones.

The benchmarks largely fail to maintain vehicle feasibility. All the benchmarks show dense sets of requests that cannot be feasibly served by vehicles. Because it prioritizes the role of vehicles over drones, the policy shows the shortest interval of infeasible requests to vehicles. As we have seen, policy also outperforms the other policies. This analysis suggests that drones may not entirely replace vehicles in last-mile delivery and that the two can benefit from working in combination.

8 Conclusions and Future Work

In this paper, we present a SDD with drones and vehicles and approximately solve it using reinforcement learning. For this purpose, we identify particular features of thes state to include as inputs to the NN and estimate values of state-decision pairs. Computational results demonstrate that the method is capable of making service decisions and assignments that appropriately balance the use of drones and vehicles throughout the day resulting in increased expected number of customers served relative to benchmarks.

There are various directions for future research. First, the analytical results show that the value function of the SDDPHF is monotonic in a number of the state elements. One direction for future research is to explore the enforcement of monotonicity to the learning process for the problem. To the best of the authors’ knowledge, the enforcement of monotonicity in NN approximations is relatively unexplored. Second, we could also explore replacing the routing heuristics used to reduce the action space. For example, instead of selecting the insertion with the minimum cost, we could use an NN to select among several feasible insertions. We can consider additional instance parameters, for example, different vehicle travel speeds to reflect peak and off-peak hours. In addition, the problem in this paper considers accepting or not customer requests for service. An alternative would be to consider the pricing of the deadline as a way of serving more customers.

Finally, our policy learns assignments of customers to fleet types based on fleet and customer features. This strategy may be valuable for a variety of dynamic routing problems with heterogeneous fleets and/or heterogeneous customers.


Appendix A Appendix

In the Appendix, we present the proofs for the analytical results, additional feature combinations, an analysis of the value of not offering service, and detailed results for the heterogeneous distribution.

a.1 Analytical Results (Proofs)

In this section, we present the proofs of the analytical results presented in Section 6.1 of the paper.

Proposition 1.

In the SDDPHF, let represent time of the decision point in the state. Then, the expected reward is monotonically decreasing in .


Note, in the SDDPHF and the simplified version, we always assume and are no earlier than . For example, if in the current state, and the vehicle becomes available at , then we set . Consider two states that only differ in time of the decision point. Say and such that . If the dispatcher is in the state , the worst case is that, the dispatcher does not accept any requests that arrive during , and then, starting , follows the same path as it will do from the state . In this worst case, the expected rewards are equal, . On the other hand, the dispatcher expects to receive