1 Introduction
The growing trend of end consumers adopting ecommerce and wanting to receive their goods without leaving the convenience of their homes has substantially increased the demand for delivery services. In a logistics and freight transport context, the “last mile” denotes the final leg of the supply chain where goods are transported and delivered directly to the final recipients. Despite being the shortest phase of the long freight transport chain, it is oftentimes regarded as the most complex and most costly, with last mile deliveries accounting for more than fifty percent of the total supply chain cost. More than ever, it has become pertinent to find more efficient ways to manage last mile delivery systems.
Last mile delivery is complex and unique in its characteristics. One source of complexity is the large number of constantlychanging, geographicallydispersed delivery locations. It is also especially common in practice for route planners to suggest routes that are tailormade to the specific urban environments, instead of recommending those provided by an offtheshelf software. The drivers themselves are known to constantly deviate from the planned optimal routes (Ceikute and Jensen, 2013; Li and Phillips, 2018). Understanding the decisionmaking behaviors of the drivers and the planners is therefore essential to offer routing plans that capture their preferences and that are ‘good’ according to their subjective evaluation. For that reason, it is crucial to optimize over a measure that represents the preferences of the drivers and planners. Given the difficulty of measuring and defining preferences, a common approach in the literature is to use machine learning to learn these preferences through past decisions, for example in recommender systems (Lu et al., 2015) or search engines (Joachims, 2002).
However, our situation is different as the predictions, the output of the method, have to be TSP solutions. While neural network approaches have been proposed that can learn combinatorial sequencetosequence models
(Vinyals et al., 2015; Kool et al., 2019), these methods are known to require huge amounts of data (more than is available in a last mile setting), and to have difficulty generalizing over arbitrarily sized routings. Furthermore, we ambition a learning approach that uses a TSP solver as a black box, meaning that the output is always predicted by a TSP solver. This means that the approach is compatible regardless of what side constraints are formulated in the TSP solver, and whether it is Mixed Integer Programmingbased or a proprietary solver of a company.On the learning side, we propose a novel combination of two learning approaches: a fast probability estimation method that estimates the transition probabilities and that can be reformulated such that any TSP solver can be used to find the maximumlikelihood TSP solution; as well as a structured output prediction method that can learn the weights of a linear preference function by gradient descent, calling a blackbox TSP solver before every weight update.
A further complication is that in a lastmiles setting, delivery stops are different from one day to another. In other words, there is rarely a repetition between delivery locations on a daily basis. That makes the preference learning task difficult to perform, due to the sparsity of the data. The goal of this article is to provide a methodology to learn the preferences of the planners and drivers when using one route over another taking into account this specific issue.
We consider the following setting: there is a set of delivery locations (or stops) denoted by . At our disposal there is a daily historical set of solutions , where each is the set of stops at day , and indicates the route plan followed on that specific day. Route plans (or solutions) are feasible solutions of the travelling salesman problem (TSP) (Miller et al., 1960)
. Vector
states other relevant features of the data of day In particular, we consider an optional special attribute namely the drivers’ subjective route quality score which takes values in .The majority of the stops appear only a few times in an , making it nonviable to directly learn relations between the stops in . However, stops are partitioned into zones of the whole territory. This zonification is represented as a partition of . In other words, and for . This situation is depicted in Figure 1. As zones cluster stops together, the historical data will have frequently reoccuring zonetozone transitions creating the possibility to learn zone transition preferences.The goal is to build a model such that given a new set of stops the model returns a routing plan that largely follows the zonelevel preferences of the planners and drivers as well as taking the total route distance into account.
We propose a twostage TSP approach to solve this problem: one TSP over the zones, and a second one over the stops that takes into consideration the sequences of zones. These two TSPs involve three learning processes: two to learn preferences over zones and one to learn good zone penalties when solving the stoplevel TSP.
A diagram of the methodology is depicted in Figure 2. We first assume that preferences can be captured as costs of going from one zone/stop to another zone/stop. This assumption is standard as it is used in Canoy and Guns (2019); Mandi et al. (2021); Chen et al. (2021). One of the advantages of this assumption is the possibility to use exact and/or approximate algorithms available for the TSP. We then estimate zone transition probabilities from historical data, as well as learning the best weighted combination of the zone distance matrix and the zone preference matrix, using structured output prediction (BakIr et al., 2007). With this zone cost matrix, a sequencing of zones is obtained by solving a zonelevel TSP. In the next stage, we use the computed zone sequence to define a zone penalty matrix, penalizing stoplevel transitions that do not respect the computed zone ordering. The weights of these penalties are again learned using structured output prediction. Together, this forms the stop cost matrix, where a TSP solver is used to computed the routing over all stops (potentially with timewindows and other operational constraints).
Thus, in our approach there are two TSPs to solve: the first one to determine a sequence of zones, and the second one to route over the stops considering the zone sequencing.
The contributions of our work are the following:

We propose a twostage TSP approach where we learn probabilistic preferences between zones, and use these preferences in defining the penalties in the TSP.

We estimate transition probabilities with a Markovian counting method, and augment it with structured output prediction (BakIr et al., 2007) to learn appropriate penalties that tradeoff distances and learned preferences at the zone level.

We compute a zoneordering but do not enforce it as hard constraints. Instead, we penalize zoneorder violations and learn the penalties over historic training data.

We experimentally show the suitability of learning only at the zone level, as well as the benefits of structured output prediction to learn both at the zone level and stop level, on the real data set.
2 Literature Review
The last mile delivery (Zeng et al., 2019) problem has been well studied in the literature. Because of its extensive use in ecommerce, it has received increasing attention in recent years. Last mile delivery may be viewed as a travelling salesman problem (TSP) (Miller et al., 1960) or a vehicle routing problem (VRP) (Dantzig and Ramser, 1959), for which there are many existing available solvers.
However, Li and Phillips (2018) state that in most cases, the last mile delivery does not go according to the planned schedule. In the context of routing between point to point, Ceikute and Jensen (2013) highlight that drivers tend to follow frequently used routes, which are not necessarily shorter in terms of distance or travel time. This is primarily because, in deciding their routes, drivers consider factors which are not in the objective function, e.g., traffic congestion and availability of parking and fuel stations. Drivers have tacit knowledge of this information. Toledo et al. (2013) also point out that drivers tend to rely on tacit knowledge to plan their routes. Explicit formalization of their knowledge poses an insurmountable task.
Several works have studied the problem of learning the routing preferences of the drivers from past routing trajectories. Delling et al. (2015) present a customizable route planning framework that infers the preferences of the drivers from GPS traces by learning individualized cost functions. Letchner et al. (2006) compute the ratios of the individual drivers’ travel time and the theoretical optimal travel time to learn the implicit preferences and biases. Both of the abovementioned studies focus on point to point commute for individual drivers.
Canoy and Guns (2019)
approach the problem in a vehicle routing setting from a different perspective—they introduce a weighted Markov model to learn the preferences from historical routes. This technique avoids the need to explicitly specify the preference constraints and implicit subobjectives. Their Markovian model computes the transition probabilities between the stops. The model is then used to solve a
maximum likelihood routing problem to come up with a routing scheme. We extend this framework to develop a twostage TSP, where in the first stage we adapt the technique to compute the transition probabilities between zones to find the zone order.3 Background: the Traveling Salesman Problem (TSP)
Given a set of stops, with (assymetric) costs associated to each pair of stops, the TSP consists of determining a minimum cost circuit passing through each vertex once and only once (Miller et al., 1960; Laporte, 1992). An integer programming formulation to solve the TSP is:
(1)  
subject to  (2)  
(3)  
(4)  
(5) 
where the variable vector is defined as above. The objective function minimizes the sum of the cost over the arcs in the route. The first two constraints ensure that each stop is visited onle once. The third constraint ensures that there are no subtours in the solution.
Our approach does not rely on an integer programming formulation, but works with any TSP solver. By abuse of notation, we will refer to TSP as an anytime oracle that returns an optimal solution of (TSP) when the cost vector is optimized, or the best one found in case a timelimit is reached.
Several exact methods and heuristics are being developed to solve this problem since the 60s, making it one of the most studied problems in operations and transportation research
Applegate et al. (2011).4 Methodology
In this section we describe in detail our methodology to learn and implement the twostage TSP. Figure 2 shows the overview, where the first learning stage, zone ordering, consists in catching preferences over transitions between zones. These preferences are modelled as a linear combination of two terms: the first one related to the distance between zones, and a second one which is a function of the frequency with which each transition appears in the data training. We then apply an structured output prediction approach to learn the weights in the linear combination. The second stage, stop ordering, consists in learning how to route over stops taking into account a predefined zone ordering. This task is performed by setting penalties each time that a solution does not respect the zone ordering. The values of this penalties are also learnt by structured output prediction.
4.1 Stage 1. Zone ordering.
The first phase of our proposed twostage approach is the zone ordering phase. Given the set of stops (delivery locations), with each stop assigned to a zone, our goal at this stage is to generate a good zone order. Note that zones need not be supplied by the user per se, one could use a clustering method, or geographical regions too. The main point is to make the number of regions of interest (stops/zones) less sparse, so that learning a function that can be applied in new unseen situations becomes more meaningful.
4.1.1 Zone order by distance
A direct approach to get a sequence of zones is to use the distance between them, and minimizing the distance of visiting all of them. Given a set of stops and the corresponding set of zones that contain the stops, we solve a TSP to visit each zone. In this case, we use as a cost in Equation (1) the euclidean distance between the centroid of zone and , noted by . Formally, if the stops inside the zone are geolocated with coordinates , then the centroid is given by:
(6) 
As the coordinates of the stops within a zone are relatively close to one another, we treat the earth as being locally flat and determine the centroid by taking the average of the latitudes and the longitudes of all the stops. Then, the distances are given by
(7) 
Note that this zone sequencing by distance approach does not make use of the historical data. There is no learning involved, hence the method is able to provide a solution for any given instance without any model training. While this traditional approach indeed gives the distancewise optimal zone sequence, there is no guarantee that the resulting order is close to that of the expertmade solution.
4.1.2 Zone order using transition probabilities
Learning zone order from historical data
We adapt the approach introduced in Canoy and Guns (2019)
of learning transition probabilities between zones from the historical data. This approach models preferences as a Markov chain, where the preferences are learnt as the probability of going from one zone/stop to another. The idea is that by determining the zone transition probabilities, we are implicitly learning the latent preferences of the expert.
Let us denote the transition probability matrix, where each component represents the probability of going from zone to . While the probabilistic model is relatively simple, it allows us to construct a zone transition probability matrix. Solving (TSP) using this transition matrix will give us the maximum likelihood zone sequence. This maximum likelihood problem can be cast as:
(8) 
which is equivalent to solve a TSP with in (1), as . So the model is compatible with any TSP solver. Now we focus on how the transition probability matrix is obtained.
(9) 
Transition probability matrix construction algorithm
Algorithm 1 outlines the process of constructing the zone transition probability matrix from historical data. Let denote the set of historical routes and the zonification.
We build a transition probability matrix by computing a weighted frequency of appearance of each transition between zones. The algorithm works as follow: For each route , an adjacency matrix is built where equals to 1 if the transition occurs, and 0 otherwise. Then, a matrix is built by summing over all the adjacency matrices weighted by instancespecific parameters . Finally, the transition probability matrix is obtained after normalization.
When a route quality score, e.g., is associated to each route, we can control how much the zone sequence contained in each route influences the resulting zone transition probability matrix (step 4 of Algorithm 1). This can be achieved by allocating a larger weight to quality routes (and a smaller weight to quality routes). For example, we can assign numerical weights corresponding to each of the score labels where Alternatively, we can train selectively with only the quality routes by setting and etc.
4.1.3 Zone order by mixing distances and transition probabilities
We take inspiration from Canoy and Guns (2019) where experimental results show that a cost matrix that mixes (1) distances and (2) transition probabilities, can lead to better solutions being found. From an application point of view, this is motivated by the observation that planners have personal and experiencebased preferences, but also always take the total distance into account. From a probabilistic inference perspective, it could be that the estimated probability matrix is uncertain (or indifferent) with respect to certain stops/zones, for example, new stops that have rarely (or even not) been visited before. This too suggests it can be beneficial to combine distances with the estimated probabilities.
The formulation of the cost matrix for the TSP is then:
(10) 
where is the estimated preference matrix and is a distancebased probability matrix.
In our distancebased probability, we want stops that are closer together (so with small ) to have higher probability than stops that are far apart. We hence first invert the distance and then normalize the rows so they sum up to :
(11) 
Both values are in the interval , but it is nonetheless impossible to know upfront how to best mix them, that is, what values of will result in a cost matrix leading to the best zone routing.
We propose a principled way to learn the values from historic data. The key challenge here is that we do not know what the true cost matrix is. If this would have been the case, we could have treated the weight learning as a standard machine learning regression problem.
Instead, we only know for each historic routing, what the followed zone tour was; and that a given tuple leads to a cost function for which we can compute the optimal zone tour using a TSP solver. In other words, the actual values of the predicted do not matter, only the resulting tour x as the optimal solution of TSP. We observe that this can be seen as a structured output prediction problem.
Structured output prediction is used in a setting where given an input, we want to predict an output that satisfies a predefined structure. For instance, in case of part of speech (PoS) tagging, the input is a sentence and the output is the sequence of PoS tags, which must have the combinatorial and compositional structure of the English language. In this case, treating the PoS tag of each word independently ould be inefficient as such an approach would fail to capture the expressivity of the language. The structured output prediction approach, on the other hand, predicts PoS tags of multiple words simultaneously and utilizes the dependencies between them to maintain the linguistic structure in the output BakIr et al. (2007)
. Structured output prediction has been successfully used in complex applications such as semantic parsing in natural language processing
(Clarke et al., 2010), semantic role labeling (Yadollahpour et al., 2013) and computational biology (Chen et al., 2020).In our case, the predefined structure is that the prediction has to be a circuit that visits every stop exactly once: a feasible TSP solution. In general, in structured output prediction, the problem of searching for a scoremaximizing output structure (called the inference task) is the following:
(12) 
where is the input (in our case: the set of zones to route over, as well as the pairwise distance and any other relevant information), is the (implicit) space of all possible outputs that satisfy the structural constraints (e.g. the feasible solutions of the TSP defined implicitly by (2)  (5)) and are the model parameters that we wish to learn. Without loss of generality we will assume that needs to be maximized to stay close to the traditional structured output prediction literature, (In contrast to the literature, we write to denote input/output structure instead of as is more natural for the TSP (output) variables in this paper), even though the TSP is formulated to minimize its cost matrix (or equivalently maximize its cost matrix).
Many approaches to structured output prediction use approximate inference techniques, for example using the Viterbi algorithm to decode a sequence (BakIr et al., 2007). However, TSPs have a peculiar structure and very efficient specialised solvers that are often anytime, meaning they can be interrupted with a timeout and will return the best solution found so far. We hence do not mind calling a TSP solver repeatedly during learning.
We will use the seminal structured perceptron algorithm of Collins (see
Collins (2002)). It is a variant of the perceptron algorithm, which in turn forms the basis of neural network learning. While it is restricted to learning a linear function over its input, this is sufficient for our needs. It furthermore has nice convergence properties in case of separable data, similar to the perceptron algorithm (Collins, 2002).We assume given a dataset of training examples with corresponding to an input structure, and the intended output structure, where .
The structured perceptron algorithm requires to define a problem specific representation function that maps every valid to a feature vector . The length of the feature vector determines the length of the weight vector , in such a way that the linear function , where is the inner product for a feature vector of length , represents the quality of the inputoutput pair . In other words, the structured perceptron algorithm assumes that such that the inference task becomes:
(13) 
The learning task is to set the weight values using the training examples as evidence.
Before looking at the structured perceptron algorithm, we will translate its required input to our TSP setting:

is a TSP solution, represented by an adjacency matrix ;

is the input structure, namely the tuple with is the number of stops in the instance, the normalized distance matrix between the stops, and the estimated probability matrix between the stops;

is the length2 weight vector that tradesoff the importance of the distance and the probability in determining the TSP quality;

hence needs to compute a length2 feature vector such that corresponds to from Equation (10). We define
(14) (15)
We can now show that
(16)  
(17)  
(18)  
(19) 
The structured perceptron algorithm learns given a dataset and a representation function . The pseudocode is shown in Algorithm 2.
The weight vector is initialized on line 1, e.g. to constant values such as or
. In every epoch
, the algorithm then iterates over the training examples in (line 3). It first computes the predicted TSP of the given instance for weight vector . If the predicted output does not match the intended TSP , then a small perceptron update to the weights is performed, using the difference between the representation function of with the true structured output and of with the predicted output . The is the learning rate: a small value that controls how large or small the weight updates are in each iteration. This procedure is repeated for epochs.In the experiments, we will evaluate in how far the use of structured output prediction on the weights can improve the zone ordering sequence found.
4.2 Stage 2. Stop ordering.
Up to now we discussed how in stage 1 we can estimate the transition probability between zones from historical data, and how we can learn the tradeoff between distance and learned probability to obtain good zonelevel TSP tours. The goal in this second stage is to find a TSP tour over all the stops, taking the predicted zone ordering of the previous stage into account.
4.2.1 Stop order by minimizing travel time
A direct approach in finding a candidate stop order for a given instance, i.e., a set of stops, is by solving (TSP) with the stops as the vertices, and the given distance or travel time matrix as the cost matrix. Again, here, only the pairwise travel times are taken into account; there is no clustering, as the zones associated to the stops are not considered, nor is there any learning from historical data. Hence, while this method results in a stop sequence that is optimal in terms of the total travel time, there is no assurance that its results are close to the ones made by the expert.
4.2.2 Stop order with zone penalties
At this stage, we propose to add to the travel times between stops some zone penalty, which we can decompose into different penalties for different degrees of order violation compared to the zone ordering of the previous stage.
Let us suppose the following:

ZO, the zone order obtained in stage 1. This function returns ZO if the zone is visited in the kth order.

the set of stops that need to be routed, with denoting the depot

a travel time matrix whose th element represents the time to travel from stop to stop
To find our desired stop order, we minimize the total travel time while giving some penalty for each transition from one zone to the next. The weight of these penalties will depend on the zone order ZO obtained in stage 1.
We propose to formulate the cost matrix for the second stage TSP as a weighted combination of the distance between stop and and how the zones of the two stops are related to the computed zone ordering.
The zone ordering induces an order, also on the stops. Let us define the order index function , where (or simply ) is equal to if and ZO. Thus, indicates the order where a stop has to be visited. We can now check that two stops belong to the same zone, by checking that which is equivalent to checking . Furthermore, we can check whether the zone of stop is the ‘next’ zone in the zone order, compared to the zone of , in which case: is true. Similarly, we can compute whether is in the ‘previous’ zone according to , or two zones ahead, etc.
Using the zone order index and the logical expressions just introduced, we can now define a penalized cost matrix for the stoplevel TSP as follows:
(20) 
where is the Iverson bracket, it returns if the logical statement in the bracket is true, and otherwise. Note how for every pair of stops and , only one of the Iverson brackets will return the value , the others will be . Hence, this function allows to put different penalties on different kinds of zone order violations.
One approach to set penalty weights is to choose them manually. For example, if we want to produce routing plans prone to follow the sequencing while minimizing total travel time, we could set and all other weights to (or a very large value).
Alternatively, the zone order penalties could decrease stepwise for larger gaps between the zones of subsequent stops. This could allow “jumps” between adjacent zones, as observed in a number of the actual expert solutions (see also data analysis in the experiment section). However, determining the right penalty values is tedious, prone to error and bias, and is even more difficult to keep uptodate over time.
4.2.3 Learning penalty parameters via SOP
Given the weighted penalty function, just like in the zoneorder TSP case, we can learn these weight penalties automatically on a training set. For this, we can again use Structured Output Prediction; more specifically, the structured perceptron algorithm described in the previous section.
We now provide a mapping to the required input of the perceptron algorithm:

is again a TSP solution, represented by an adjancency matrix (in this case, over the stops instead of over the zones);

is the input structure, for which in this case we propose the tuple where is the number of stops in the instance, the travel time matrix between the stops, and the zone order index of the stops;

is the length7 weight vector that balances the importance of the distance and the different penalties;

is the length7 feature vector which computes the total distance as well as the total number of violations of every type; we define
(21)
From this, as in the previous section, we can similarly derive
We want to stress that although Equation (20) contains logical expressions, every is a constant for a given zone ordering hence every computed is also a constant. While the equation looks like a multiobjective formulation, and it is a multiobjective formulation, each of the subobjectives is over the same variables and hence this ‘multi’ objective formulation can be equivalently rewritten as a single linear function over the variables.
5 Numerical Results
We now turn to evaluating the performance and benefits of our proposed approach.
Data description
For an empirical evaluation of the proposed methodology, we conduct our experiments on data provided by Amazon for the Last Mile Routing Challenge ^{1}^{1}1https://routingchallenge.mit.edu/. The provided data consists of 6112 historical routes. Each route is a feasible TSP solution which starts and ends at a station, and visits from 31 to 238 stops. As displayed on Fig. 3(a) the distribution of routes between the 17 stations is not uniform. Fig. 3(b) shows how stops are shared between stations. All stops are clustered into 8868 zones. Planning experts assigned a quality score label (high, medium, low) to each route.Vehicle capacities, times of departure and time windows of stops were not considered, although they can be added to the TSP solver without changes to the methodology.
We can extract a zone ordering from each route and empirically observe transitions between (non)adjacent zones. Table 1 denotes the frequency of such transitions. next and prev respectively stands for moving ahead and backward according to the zone ordering. next2 and prev2 denote moving two steps in the related direction. next3+ and prev3+ account for three or more steps, which is a more blatant zone order violation. As displayed on Table 1, those are more common in lower quality route. Mediumquality routes respectively have on average and of those transitions, against only and for highquality routes. The small amount of lowquality routes available does not allow for a fair comparison. Nevertheless, this observation substantiates the underlying hypothesis that more zone order violations in a route lead to worse route quality, and that different penalties for different order violations are sensible.
Scoring
The Amazon challenge scoring function computes the similarity between the actual (zone or stop) sequence and an algorithmproduced sequence. The computation combines Sequence Deviation and Edit Distance with Real Penalty:
where is the actual sequence and is the usersubmitted sequence. denotes the Sequence Deviation of with respect to , denotes the Edit Distance with Real Penalty applied to sequences and with normalized distance or travel times, and denotes the number of edits prescribed by the algorithm on sequence with respect to
System description
All experiments were run on a Lenovo ThinkPad X1 Carbon with an Intel Core i7 processor running at 1.8GHz with 16GB RAM. As TSP solver, we used ORTools 9.0 with the guided local search as the search strategy. During SOP training and to compute the final zone and stop TSPs that are scored, we set a timeout of 30 seconds to each TSP solver call. The timeout is reached for each TSP computation. Probability estimation and the structured perceptron algorithm were implemented in numpy. We used a fixed learning rate of determined after smallscale experiments on the training data. We train for epoch unless mentioned otherwise. ^{2}^{2}2
We will make our implementation available under an open source license upon acceptance of the paper.
same  next  next2  next3+  prev  prev2  prev3+  

route quality label  
high  80.355%  13.552%  0.115%  0.071%  0.209%  0.050%  0.168% 
medium  89.295%  13.642%  0.107%  0.073%  0.190%  0.048%  0.178% 
low  87.610%  13.668%  0.120%  0.053%  0.174%  0.033%  0.140% 
Training and testing set split
Note that we do not have access to the hidden dataset used in the challenge evaluation. We hence conduct our experiments on the historical data which we split into a train and a test set. As each of the 6112 historical instances is tagged with a route quality label, we employ stratified sampling in order to sample each route label subpopulation (high, medium, low) independently.
Route score  Total  Train  Test 

high  2718  2174  544 
medium  3292  2634  658 
low  102  82  20 
Total  6112  4890  1222 
Routes used for training  zone_score  

Distance  Markov  Markov+Distance  SOP  
high  0.0642  0.0422  0.0314  0.0317 
high + medium  0.0642  0.0371  0.0293  0.0291 
high + medium + low  0.0642  0.0374  0.0291  0.0289 
# of epochs  zone_score  

0  1  1  0.0291 
1  1.58  2.33  0.0289 
2  2.16  3.66  0.0297 
3  2.75  4.99  0.0296 
4  3.33  6.32  0.0295 
5  3.91  7.66  0.0295 
Stage 1 experiments: zonelevel ordering quality
In the first set of experiments, we test and compare the performance of the different Stage 1 zone sequencing approaches that we presented in the previous section, namely, the distancebased (Distance), transition probabilities/preferencebased (Markov), and the mixed distance and transition probabilities (SOP) approaches, where the component weights in the mixed approach are determined by structured prediction.
A graphical visualization of the distribution of the zone scores, as well as a summary of the average zone scores, for the four approaches are shown in Figure 3. Visually, we can already notice a significant improvement when using the Markov model instead of the absolute distancebased approach. Furthermore, when distance and preference components are mixed with equal weight, we get further improved zone scores. The use of Structured Output Prediction (SOP) to find better weights marginally improves the result further. A histogram showing the frequency distribution of the zone scores of the four approaches is shown in Figure 6.
In the table at the bottom of the figure we show the average zone score when using only the high quality labeled routes, or also the medium and low. Although they differ in quality, we can see that using all data results in better overall zone orders. We believe this is due simply to the amount of data available, as can be seen from Table 2.
We take a closer look at the performance of the SOP approach in Figure 4. For this experiment, we ran the structured perceptron algorithm for multiple epochs. In the figure, we can see that starting from uniform weights, the zone score on the test data slightly increases before it worsens again a bit. Looking at the (interpretable) weights, we see that learning especially increases the weight of the prefence matrix compared to the weight of the distance matrix. Runtimes are not shown, but we observed that every instance reaches the timeout of 30 seconds during SOP training.
Penalty weights  Zone sequencing approach  
Trav_timebased  Dist  Markov  SOP  
2  1  1.5  2  1.5  2  2  0.0929  0.0617  0.0510  0.0501 
2  1  2  4  2  4  6  0.0929  0.0618  0.0507  0.0498 
2  0.35  2.13  4.09  2.32  4.02  6.10  0.0929  0.0625  0.0479  0.0465 
Stage 2: stoplevel experiments
The different stage 2 stop ordering approaches are evaluated and compared in this next set of experiments. The first (Trav_time_based) is the conventional approach of solving (TSP) with the given travel time matrix as the cost matrix. The other three are zone penalty approaches, where the zone penalty is determined by the zone order generated in Stage 1.
Numerical results are summarized in Figure 5. The figure summarizes the results, showing that a nonzone based approach using distances performs worst. Even when only using distances, but in a twostage approach, e.g., zone penalties from a distancebased zone ordering, already leads to improved results. This is in line with our data analysis that a zone ordering is typically followed. Our next two approaches, Markov_zone_seq and SOP_zone_seq differ only in how the zone ordering was computed. As their zone orderings performed rather similar, we also see here that their stoplevel scores are also rather similar.
We now look at the effect of different zone penalty weights, at the bottom of Fig 5. We tried two handcrafted penalty values: one in which the penalty for ‘larger’ zone violations gradually increases (1, 1.5, 2) and one in which it more rapidly increases (1,2,4,6). Finally, the bottom line, which was initialized with the betterperforming second row values, shows the output after 1 epoch of SOP learning on these weights. We can see that the best approach is the combination of Markov estimation with SOP both at the zonelevel and at the stoplevel. Looking at the bottom entry in particular, we can see that the stoplevel SOP especially learned to decrease the samezone penalty weight, as well as increasing ‘previous’ and ‘next’ penalty scores. The histogram in Figure 6 shows the frequency distribution of the stop scores of the four approaches, where most of the scores of the better preforming methods, SOP and Markov, tend to be closer to zero.
6 Concluding Remarks
This article studies the problem of learning the preferences of drivers and planners over a set of delivery locations in the last mile delivery context. The main difficulty of this problem is that clients do not appear often in the historical data, making it impossible to learn transition preferences at the stop level. Hence, we study a setting where stops are clustered into zones. We propose a twostage TSP approach to solve this problem.
In the first stage, the preferences of transitions between zones are learned from historical data. This allows us to produce sequences of zones aligned to the preferences of the expert users. We adapt from literature a Markovian approach to learn zone transition probabilities and combine these with pairwise zone distances to generate zone sequences that not only minimize the travel distance but also maximizes user preferences. Furthermore, we apply the structured output prediction approach to learn the optimal parameters to be used in combining the distances and the transition probabilities.
In the second stage, we structure the stop routing problem into a TSP whose objective function combines the travel time and zone transition penalties derived from the zone order obtained in the previous stage. As the objective function now takes the form of a weighted multiobjective function, we again see the opportunity to apply structured prediction to finetune some initial penalty values.
Our computational results show that compared to the conventional method of stop routing by TSP using travel times, our twostage zoning approach significantly performs better. Stage 1 experiments confirm that combining distances and transition probabilities result to better zone sequences. Stage 2 experiments show that our proposed zone penalty framework results to higher quality stop sequences. In both sets of experiments, we confirm that we can obtain improved results by using structured output prediction.
Future work will involve applications to other extended settings, such as vehicle routing. Another relevant future research question is to measure the impact in accuracy and computational times of using different methods to solve the TSP either exactly or by heuristics; as well as incremental
methods that do not require to resolve from scratch in every structuredoutput prediction loop. Another avenue that can be explored is additional features to be used for learning, both at the zone level and the stop level, as well as the use of other structured output prediction methods including deep learning ones. However, more complex learning architectures may also require more training data.
References
 The traveling salesman problem. Princeton university press. Cited by: §3.
 Predicting Structured Data. The MIT Press. External Links: ISBN 9780262255691, Document, Link Cited by: item 2, §1, §4.1.3, §4.1.3.
 Vehicle routing by learning from historical solutions. In International Conference on Principles and Practice of Constraint Programming, pp. 54–70. Cited by: §1, §2, §4.1.2, §4.1.3.
 Routing service quality–local driver behavior versus routing services. In 2013 IEEE 14th International Conference on Mobile Data Management, Vol. 1, pp. 97–106. Cited by: §1, §2.
 An inverse optimization approach for a capacitated vehicle routing problem. European Journal of Operational Research. Cited by: §1.
 RNA secondary structure prediction by learning unrolled algorithms. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Cited by: §4.1.3.
 Driving semantic parsing from the world’s response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, CoNLL 2010, Uppsala, Sweden, July 1516, 2010, M. Lapata and A. Sarkar (Eds.), pp. 18–27. Cited by: §4.1.3.

Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms
. In Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), pp. 1–8. Cited by: §4.1.3.  The truck dispatching problem. Management science 6 (1), pp. 80–91. Cited by: §2.
 Navigation made personal: inferring driving preferences from gps traces. In Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, pp. 1–9. Cited by: §2.
 Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133–142. Cited by: §1.
 Attention, learn to solve routing problems!. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 The traveling salesman problem: an overview of exact and approximate algorithms. European Journal of Operational Research 59 (2), pp. 231–247. Cited by: §3.
 Trip router with individualized preferences (trip): incorporating personalization into route planning. In AAAI, pp. 1795–1800. Cited by: §2.
 Learning from route plan deviation in lastmile delivery. Master’s Thesis, Massachusetts Institute of Technology, Cambridge. Cited by: §1, §2.
 Recommender system application developments: a survey. Decision Support Systems 74, pp. 12–32. Cited by: §1.
 Data driven vrp: a neural network model to learn hidden preferences for vrp. In 27th International Conference on Principles and Practice of Constraint Programming (CP 2021), pp. 42. Cited by: §1.
 Integer programming formulation of traveling salesman problems. Journal of the ACM (JACM) 7 (4), pp. 326–329. Cited by: §1, §2, §3.
 Decisionmaking process and factors affecting truck routing. In Freight Transport Modelling, Cited by: §2.
 Pointer networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §1.

Discriminative reranking of diverse segmentations.
In
2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2328, 2013
, pp. 1923–1930. Cited by: §4.1.3.  Lastmile delivery made practical: an efficient route planning framework with theoretical guarantees. Proceedings of the VLDB Endowment 13 (3), pp. 320–333. Cited by: §2.
Acknowledgments
This research received partial funding from the FWO Flanders project grant FWOS007318N (Datadriven logistics), the European Research Council (ERC H2020, Grant agreement No. 101002802, CHATOpt), and the Institute for the encouragement of Scientific Research & Innovation of Brussels (Innoviris, 2021RECONCILE).
Comments
There are no comments yet.