Service failures are pervasive in supply-chain networks, with important consequences on their cost-efficiency and customer experience. We aim at predicting and explaining the cause of such failures, focusing on the last-mile pickup and delivery of items at customer locations. Such services are planned by optimizers solving some variations of the Vehicle-Routing Problem, in our case the Pickup and Delivery Problem with Time Windows (PDPTW [ropke2009branch]). Solutions consist of routes, eventually associated to a specific truck and driver, defined as a sequence of stops associated with customer locations, where services are delivered, with time-window, capacity and precedence constraints. Routes start and end at dispatch centers, where trucks are loaded.
We aim at predicting service failures that occur after the route was planned, such as customer not at home, services rescheduled by dispatch center, and service refused or canceled by customer. Such predictions could be leveraged by optimizers to find routes with reduced amounts of failures, for instance by adjusting the slack time to maximize the probability that customers will be at home at the time of the service. They could also be used directly by human planners, for instance to increase the amount of confirmation calls to customers in case of high failure probability.
Current approaches to solve the pick-up and delivery problem with time windows include heuristics[rop06], meta-heuristics [nanry2000solving] and exact methods [ropke2009branch]. To the best of our knowledge, they do not take service failures into account. Instead, the solutions suggested in the literature to reduce service failure in last-mile problems include Collection Delivery Points [song2009addressing], reception boxes, and delivery boxes [punakivi2001solving]. While these methods greatly improve service efficiency, they also reduce customer satisfaction by not delivering items directly to their homes. Instead, we aim at (1) predicting failure probabilities, (2) identifying factors that predict failures, and (3) suggesting counter-measures to avoid failures.
We break down the problem of failure prediction into multiple, independent supervised classification problems. Given a failure type, our goal is to predict if a stop will fail with this type. We are also looking for association between stop features and failure, to provide specific ranges of values that lead to increased or reduced failure rates.
We build our classifier using Random Forests [breiman2001random]
, as they are one of the most successful and frequently used method for supervised classification. Random Forests belong to ensemble learning methods: they combine the predictions of multiple decision trees built from randomly-selected features, to reduce overfitting. Random Forest also provide measures of feature importance, which helps interpret classification results. To provide further insights on the failure causes, we extract Association Rules between categorized feature values and specific failure types.
We analyze a large dataset of more than 500,000 pick-up and delivery services that occurred in Canada over a period of 6 months, among which a few percents failed. The dataset is strongly imbalanced toward successful services, as it is commonly the case in failure analysis. This is an issue for classifiers as they tend to focus on the majority class, leading to poor sensitivity. To address this issue, we leverage resampling methods to either undersample the majority class [mani2003knn] or oversample the minority class [chawla2002smote].
This paper makes the following contributions:
We apply Random Forests to the prediction of stop failures in the pickup and delivery problem with time windows and precedence constraints.
We apply Association Rules to the identification of factors predicting failures types.
We analyze a dataset of 500,000 pickup and delivery services, on which we quantify the performance of the classifier and highlight feature ranges leading to increased failure probabilities, for 5 failure types.
Ii Dataset and methods
We extracted a dataset from the database of ClearDestination Inc, a Montreal-based company providing services for supply-chain optimization. The dataset contains 523,643 pickup and delivery services grouped in 183,872 stops, scheduled between September 2017 and February 2018 in Canada. Table I shows the features describing the stops and their services. Features describe the customer location, position of the stop in the route produced by the optimizer, phone call status (phone calls are placed to the customer at specific times before the service), date, and service characteristics. Some features may be correlated: for instance, features describing geographical location are strongly interdependent.
Ii-A2 Data representation
The dataset contains records where each stop is associated to one or more services, represented as a vector of variable size (3 services per stop on average). This is a problem since classifiers cannot work on spaces containing vectors of variable size. A solution could be to create one record for each service, and to replicate the stop features in all the services associated with the stop. However, such a replication would likely lead the classifier to overfit particular stops, and to rely, for instance, on the exact latitude and longitude of the stop to predict failures. This is not desirable in our case, since we aim at predicting failures in situations more general than particular stop locations that may never re-occur.
Instead, we aggregated the services of a particular stop in a “master service”, for which the value of categorical service features (Service_Type, Retailer and Item_Manufacturer) was determined as the most frequent value among services in the stop, and the value of numerical features (Item_Volume, Item_Weight and Estimated_Service_Time) was the sum of the values among services in the stop. The status and failure type of the stop were set as the most frequent status and failure type among its services. The resulting aggregated dataset contains 183,872 stops with 6.68% of failures. Figure 1 shows the failure distribution by failure type in the aggregated dataset.
Ii-A3 Failure types
We focus on some of the most frequent failures, namely:
Customer not at home (NAH, frequency: 1.46%)
Stop rescheduled by dispatch center (SR, 0.80%)
Refused by customer (RC, 0.60%)
Canceled by customer (CC, 0.49%)
Not in Stock (NS, 0.34%)
Failure type Customer not at home (NAH) happens when the customer is not present at the time of the service. Failure type Stop rescheduled by dispatch center (SR) may happen due to any unexpected event in the supply chain, for instance construction in the delivery area, or inbound delays at the dispatch center. Failure type Refused by customer (RC) means that the item was delivered to the customer’s place, but the customer refused it, perhaps because it did not match their expectations. Failure type Canceled by customer (CC) means that the service was canceled at the customer’s location, for instance because the customer did not have cash. Finally, failure type Not in stock (NS) means that the item was not in stock at the dispatch center on the day of delivery, which may happen when the information is unknown at the time of the planning.
The failure types present in the dataset may not always be very accurately used. This is due to the fact that they are set by different actors in the supply chain, among which the planning company, the dispatcher and the drivers, who may interpret the failure types differently. In particular, some failure types may overlap, which may disturb the classification.
We pre-processed the dataset to remove duplicate entries, to replace missing values with a constant (of -100), and to encode categorical features as numerical values. Only 3 features had missing values: Item_Manufacturer (S5, missing rate of 90.8 %), Apartment_number (C5, 82.7%), and Door_number (C3, 8.6%). In the remainder, results involving Item_Manufacturer and to some extend Door_number should be interpreted carefully due to the high missing rate. Missing apartment numbers, however, make sense as many addresses simply do not include an apartment number.
Ii-A5 Dataset imbalance
We used the following 3 methods to deal with dataset imbalance. First, we oversampled the minority class using Synthetic Minority Oversampling TEchnique (SMOTE [chawla2002smote]). SMOTE generates synthetic entries in the minority class, here the class of failed stops, using random linear combinations of existing failed stops. We applied SMOTE-regular using 2 nearest neighbors, and an oversampling ratio equal to the ratio between the number of elements in the majority class and the number of elements in the minority class (so that the resulting dataset is evenly distributed among both classes).
We also evaluated NearMiss, a method that undersamples the majority class using k-nearest neighbours [mani2003knn]. We used NearMiss-3, which, for every element in the minority class, determines its nearest neighbors and keeps only the furthest ones. We applied this method with 3 nearest neighbors.
Finally, we assessed a straightforward random undersampling of the majority class, as we suspected that SMOTE and NearMiss would be disturbed by their use of the Euclidean distance to determine nearest neighbors, which is questionable in our dataset.
Ii-B Classification using Random Forest
We trained Random Forest classifiers on the aggregated dataset, using a training ratio of 80% and 5-fold cross validation. We used Random Forests as implemented in scikit-learn version 0.19.1.
To find suitable values of the Random Forest parameters, we performed a grid search on a training dataset with different numbers of estimators, maximum depths of a tree, split criteria and minimum numbers of samples to split on a node. Table II shows the selected parameter values.
|Parameter||Description||Selected value||Grid search|
|# Estimators||Number of trees in the random forest||100||10, 50, 100, 200, 500|
|Max depth||Maximum depth of the trees||6||5, 6, 7, 8, 9, 10, 20, 50|
|Split criterion||Impurity criterion used in tree nodes||Gini||Gini, Entropy|
|Max features||Number of features to consider in each split||–|
|Min samples to split||Minimum number of samples required to split a node||10||–|
|Min sample in leaf||Minimum number of samples required in a leaf node||5||–|
|OOB Score||Out of bag error||True||–|
We also captured feature importance in the Random Forest classification, as reported by scikit-learn. It is defined by the scikit-learn developers111https://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined, as the total decrease in node impurity (weighted by the probability of reaching that node, which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble. However, feature importance provides limited insights to interpret classification results. One of the reasons is that correlation among a group features reduces the mean importance in this group of features, as discussed in [genuer2010variable]. To further interpret the results, we also extracted Association Rules as explained hereafter.
Ii-C Association Rules
We extracted Association Rules from the aggregated dataset, to check the consistency of classification results and to provide further insights on their interpretation. To do so, we categorized numerical features into deciles, and we represented stops with vectors containing (1) such categorized features, (2) the initial categorical features, and (3) a binary feature representing the stop status (success or failure type). An Association Rule, written as “antecedentconsequent”, consists of two tuples, an antecedent and a consequent [agrawal1994fast]. We focus on the rules where the consequent is a singleton containing a stop status. To represent features in the antecedent, we postfix numerical features with _Dx, to indicate that the value is in the x decile, and categorical features with _Vx, to indicate that the value is x. A hypothetical example of rule is:
which measures the association between failure type “Customer not at home”, and stops where Start_Slack is in the third decile and Day has value 4. It should be noted that Association Rules provide a measure of co-occurrence rather than causality.
To measure the relevance of a rule , we define its interest ratio (IR) by comparing the frequencies of tuple x in the complete dataset versus in the set of failed stops (the stops that contain ):
where , the support of tuple in set , is the number of occurrences of in . We focus on the cases where , which gives . Then we define the interest ratio as follows:
The interest ratio measures the effect of on the probability to fail with type . Another way to understand it is to express its relation to the failure probability conditional to the presence of :
We also compute the confidence of rule with the usual definition:
The following relation should finally be noted:
We are looking for rules with high interest ratio (large or small ), and high frequency in the failed set (, where is the desired support threshold in ). We find them using the following approach:
Find the set of items in s.t. .
For every in , compute .
We perform step 1 using the FP-growth algorithm [han2000mining], as implemented in Apache Spark version 2.3.1. Note that finding the frequent itemsets in requires much less memory than in since . We implemented step 2 using a single pass on , which does not raise any memory issue since is small. To obtain a limited set of rules, we then select the Association Rules such that and:
is the size of the rule, i.e., the number of elements in . is the set of rules of size i and is a partial order on defined as follows:
Iii-a Classification results
Figure 2 shows the sensitivity (ratio of true positives) and specificity (1 - ratio of false positives) obtained for the different failure types and resampling methods. Without resampling, the sensitivity to failure remains 0, as expected in such an imbalanced dataset. Oversampling with SMOTE improves the sensitivity to an average of 0.36 while maintaining a high specificity of 0.92. Undersampling with NearMiss and Random Undersampling further increases the sensitivity to an average of about 0.7, with a specificity close to 0.7. This appears to be the best compromise between sensitivity and specificity. On average, Random Undersampling performs slightly better than NearMiss. In the remainder, we focus on results obtained with Random Undersampling.
The classification performance is quite stable across failure types. With Random Undersampling, the best specificity values are obtained for NAH and NS, while RC is slightly under average. Sensivitiy values are close to average for all failure types, NS being slightly above.
Iii-B Important features and Association Rules
Iii-B1 Customer not at home (NAH)
Figure 2(a) shows the feature importance resulting from the classification of failures of type “Customer not at home”. Feature labels refer to the ones in Table I, and feature importance is computed as explained in Section II-B. Feature importance is largely dominated by a single feature, Detailed_Call_Status (P3), peaking at an importance of 0.27. The next 18 features have similar importance values, ranging from 0.026 to 0.05. The remaining 13 features are between 0.00 and 0.022.
Figure 2(b) shows the antecedents of the Association Rules with consequent FAIL_NAH, selected as described in Section II-C, with their confidence and interest ratio (IR). For clarity, the elements of rules of size 1 are omitted in rules of size 2. For instance, Rule 1 means that (Detailed_Call_Status_V3 FAIL_NAH) has a confidence of 0.036 and an interest ratio of 2.45, and Rule 2 means that (Detailed_Call_Status_V3, Estimated_Service_Time_D2 FAIL_NAH) has a confidence of 0.06 and an interest ratio of 4.07. Rules with are represented in red, and rules with are in green. For instance, Rule 1, shown in red, means that Detailed_Call_Status=3 increases failure probability by a factor of 2.45, while Rule 13, shown in green, means that Detailed_Call_Status=2 decreases failure probability by a factor of 2.06.
The rules in Figure 2(b) are consistent with the features importance in Figure 2(a), Detailed_Call_Status (P3) being the most important feature. They show that P3=3, which means that a call landed on voicemail, increases the failure probability by 2.45 times (Rule 1). This ratio increases to 3.67 if the call was marked failed (Rule 3: P2=5), to 4.07 if the estimated service time is shorter than 8 minutes (Rule 2: S6 in D2 (240-480]), or to 3.12 if the item volume is low (Rule 5: S3 in D1 [0.0, 0.002]), perhaps because less voluminous items are cheaper on average and customers give less value to them. The failure probability also increases if the item is delivered by specific companies (Rule 4: S2=3 and Rule 8: S2=39), in the Montreal/Laval area (Rule 6: C9=22), on a Tuesday, Wednesday or Thursday (Rule 9: D3=2, Rule 11: D3=3, Rule 7: D3=4), if the time window starts between 6am and 8am (Rule 10: R7 in D1 (359.99, 480.0]), or if the service is planned between 10am and 12pm (Rule 12: D2=2).
Conversely, P3=2, which means that a call was answered by a human, reduces the failure probability by 2.06 times (Rule 13). Failure probability is further reduced if the item is lighter than 2 lbs (Rule 14: S4 in D1 [0.0, 2.0]), perhaps because drivers can leave small items unattended at the customer’s door when they agreed during a phone call. Finally, failure probability is also reduced if the address has no apartment number (Rule 15: C5 in D1) or if the time window provided to the customer is short (Rule 16: R9 in D2 (120.0, 180.0]).
Iii-B2 Stop rescheduled (SR)
shows the feature importance resulting from the classification of failures of type “Stop rescheduled”. The failure importance is more uniformly distributed than for NAH. Four features stand out: S2 (Retailer), D1 (Week of Year), R9 (Time_Window_Size) and R2 (Driver). Feature importance remains quite constant for the next 8 features, and it seems to decrease linearly to 0 for the remaining features.
The Association Rules in Figure 3(b) confirm the importance of the Retailer: some companies increase the failure rate (Rule 1: S2=39, Rule 47: S2=149), and other ones reduce it (Rule 13: S2=3). The failure rate is also increased by high start slack times (Rule 9: R10 in D9 (120.0, 129.0], Rule 40: R10 in D8 (115.0, 120.0]), and by high end slack times (Rule 14: R11 in D6 (116.0, 120.0], Rule 44: R11 in D7 (120.0, 128.0]). The time window size also has an effect on the failure rate: D3 (180.0, 240.0] increases the failure rate (Rule 17, 46, 49, 54 and 58), while D2 (120.0, 180.0] reduces it (Rule 29). As for NAH, a call landing on voicemail (P3=3) increases the failure probability (Rule 34). Interestingly, services executed toward the end of the route tend to be rescheduled more often (Rule 55: R3 in D10 (16.0, 36.0]), and so do services with a short estimated job time (Rule 56: S6 in D1 (0.99, 240.0]). Finally, failures are also more frequent for services with a median weight (Rule 32: S4 in D5(382.0, 4,7016.0]) or for volumes lower than 18.3 cf (Rule 42: S3 in D3 (1.45, 18.3], Rule 52: S3 in D2 (0.002, 1.45]).
Iii-B3 Refused by customer (RC)
Figure 4(a) shows the feature importance resulting from the classification of failures of type “Refused by customer”. Feature S2, Retailer, is standing out again. The importance seems to decrease linearly for the next features, with a slight increase for C2 (Latitude) and S4 (Item_Weight), and a slight drop between R3 and P3.
The Association Rules in Figure 4(b) show the effect of R9 (Time_Window_Size): when in D3 (180.0, 240.0] (Rule 1), it reduces the failure rate by a factor of 1.78, while when in D4 (240.0, 300.0] (Rule 5), it increases it by a factor of 1.74. The failure rate also increases for company 8 (Rule 7: S2=8), for an estimated service time in D3 (480.0, 720.0] (Rule 11), in the Toronto area (Rule 16: C1 in D4 (-79.469, -79.254], Rule 18: C2 in D3 (43.595, 43.737], Rule 30: C9=23), for the highest start slack times (Rule 19: R10 in D10 (129.0, 751.0]), for services scheduled between 11:39am and 12:08pm (Rule 24: R7 in D6 (700.0, 728.0]), and for voluminous items (Rule 28: S3 in D6 (49.02, 59.6]).
Iii-B4 Canceled by customer (CC)
Figure 5(a) shows the feature importance resulting from the classification of failures of type “Canceled by customer”. As in the two previous failure types, Retailer (S2) is standing out, and the importance seems to decrease linearly among the other features.
The Association Rules in Figure 5(b) show the importance of Time_Window_Size (R9), as in the previous failure type: services tend to fail less when R9 is in D3 (180.0, 240.0] (Rule 1), and they fail more when R9 is in D2 (120.0, 180.0] (Rule 8, 12, 24, 36, 50 and 55) or in D4 (240.0, 300.0] (Rule 19). Two particular companies also have increased failure rates (Rule 13: S2=8, Rule 25: S2=3). The failure rate is also increased in the Montreal area (Rule 5: C6=2118, Rule 5: C2 in D7 (45.452, 45.52], Rule 72: C2 in D8 (45.52, 45.619], Rule 10: C1 in D8 (-73.586, -73.289], Rule 22: C1 in D7 (-73.754, -73.586], Rule 33: C9=22), when a call lands on voicemail (Rule 45: P3=3), when the service is toward the end of the route (Rule 64: R3 in D10 (16.0, 36.0]), when the service time window starts around mid-day (Rule 69: R7 in D6 (700.0, 728.0]) or ends between 4pm and 5pm (Rule 70: R8 in D9 (960.0, 1020.0]), and when the item has a very low or close-to-average volume (Rule 51: S3 in D1 [0.0, 0.002], Rule 59: S3 in D6 (49.02, 59.6]).
Iii-B5 Not in stock (NS)
Figure 6(a) shows the feature importance resulting from the classification of failures of type “Not in stock”. The most important feature is the week of the year (D1), followed by features related to geographical location (C2, C1, C8 and C9), features related to the route (R9, R11 and R10), the company (S2) and the volume (S3).
This is consistent with the Association Rules in Figure 6(b). Note that we used different filtering parameters for this failure type, due to the important number of rules with high IR. Rule 2 has an extremely high IR of 110.43, for a confidence of 37.4%: it means that company 158 had a not-in-stock failure rate of 37.4% in province 1 (Québec). Some geographical locations spanning from Gatineau to Sorel-Tracy have increased failure rates (Rule 8, 42 and 51: C1 in D7 (-73.754, -73.586], Rule 12 and 55: C1 in D6 (-75.601, -73.754], Rule 9: C2 in D8 (45.52, 45.619], Rule 14: C2 in D9 (45.619, 46.328], Rule 60: C2 in D7 (45.452, 45.52]) while other ones have decreased failure rates (Ontario, Rule 49: C7=2). Two specific weeks have increased failure rates: week 36 (Rule 19), which was the week of Labor Day in 2017, and week 44 (Rule 50), which was the week of Haloween. As for route features, time windows shorter than 2 hours (Rule 5: R9 in D1 [0.0, 120.0]), negative end slack times (Rule 32: R11 in D1 (-537.001, 62.0]), and start slack times between 53 and 63 minutes (Rule 21: R10 in D3 (53.0, 63.0]) have increased failure rates, while time windows between 2 and 3 hours (Rule 72: R9 in D3 (180.0, 240.0]) reduce the failure rate. Specific companies increase the failure rate (Rule 1: S2=158, Rule 41: S2=7) while other ones reduce it (Rule 71: S2=3).
Iv-a Classification results
Random Forests showed good performance when applied to the dataset pre-processed with Random Undersampling: they reach an average sensitivity of 0.7 and an average specificity of 0.7. Thus, 70% of the failures of the studied types could be predicted, which represents 4,750 failed stops per year in the studied dataset. This prediction ability is an opportunity to save on pickup and delivery costs.
The classification performance could be further improved by (1) improving the quality of the dataset, in particular through a better definition and separation of failure types, (2) improving the dataset aggregation technique to deal with records of non-uniform sizes; our technique essentially averages the features of the services in a stop, which leads to information loss, (3) improving the strategy to deal with dataset imbalance, perhaps through a more specific oversampling method. Regarding point (3), the poor performance of SMOTE compared to the other resampling methods is illustrated in Figure 8. The linear combinations of services generated by SMOTE are not realistic. In particular, the generated services do not respect natural boundaries such as lakes or uninhabitated regions, not to mention roads or actual addresses. This behavior is not surprising, since no such constraints were included in the oversampling method. Similar inconsistencies are also very likely to happen in other features. On the contrary, NearMiss and Random Undersampling maintain a realistic distribution of services, at the cost of reducing the dataset. A more constrained oversampling technique might be able to address this limitation.
Iv-B Important features and Association Rules
Overall, we observed a good agreement between the feature importance obtained from Random Forest and the selected Association Rules. Nevertheless, most Association Rules have a low confidence value, below 5%, which shows that failures are predicted from combinations of features rather than straighforward associations. We conclude that Association Rules, computed and selected using the methods we presented, are a relevant addition to RF feature importance to provide finer-grained interpretation.
It should be noted that our extraction of Association Rules focused on rules where the antecedent occurs more than s times among the failed stops. This explains why the selection was biased towards rules with (rules displayed in red).
Iv-C Suggested counter-measures
The Retailer has a measurable effect on all the failure types. Specific investigations among the companies with failure rates higher than average should be conducted, to better understand the failure causes.
Failure of type “Customer not at home” (NAH) are very dependent on confirmation calls. In case such calls are not answered, additional ones should be scheduled, in particular if the estimated service time is short, if the item volume is low, if the item is not delivered on a Monday or Friday, if the time window starts before 8am, or if the service is planned between 10am and 12pm. In addition to these indicators, our trained Random Forest model could be used to recommend additional calls specifically for services predicted to fail. There might even be situations were multiple unanswered calls should result in the service to be removed from the route, if the specificity could be made close enough to 1.
Failures of type “Stop rescheduled” (SR) are associated with many features related to the route (R3, R9, R10, R11) and a few other ones related to the type of service (S3, S4, S6). Such information could be included in optimizers, to facilitate the building of routes with less failures of this type. Start slack times longer than 2 hours lead to increased failure rates, which suggests that failures might happen due to delays in the route: dispatch centers might decide to skip stops when the service won’t happen in the time window, which happens with higher probability when the start slack time is high. Likewise, services scheduled toward the end of the route are rescheduled more often than average, perhaps again due to delays in the route. The Time Window Size also has an effect on the failure rate: increased failure rates are observed for window size longer than 3 hours.
Failures of type “Refused by customer” (RC) are also associated with route-related features (R7, R9 and R10), perhaps because delays lead impatient customers to refuse items. In addition, they seem to occur more frequently at specific geographical locations (Toronto area). Again, this information could be used by optimizers to build routes with less of such failures. Such zones might also be further investigated to understand the reasons for refused items. In addition, specific items (volume in D6 and estimated service time in D3) have an increased failure rate of this type, which might be reported to the manufacturers.
Failures of type “Canceled by customer” (CC) are also associated with route-related and geographical features (Montreal area), which could again be used by optimizers.
Finally, failures of type “Not in Stock” (NS) are strongly related to one specific retailer for which more than 10% of the services fail, and even 37% in Québec. This should be reported to the retailer and further investigated.
This work was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC).