Can Machine Learning Help in Solving Cargo Capacity Management Booking Control Problems?

Revenue management is important for carriers (e.g., airlines and railroads). In this paper, we focus on cargo capacity management which has received less attention in the literature than its passenger counterpart. More precisely, we focus on the problem of controlling booking accept/reject decisions: Given a limited capacity, accept a booking request or reject it to reserve capacity for future bookings with potentially higher revenue. We formulate the problem as a finite-horizon stochastic dynamic program. The cost of fulfilling the accepted bookings, incurred at the end of the horizon, depends on the packing and routing of the cargo. This is a computationally challenging aspect as the latter are solutions to an operational decision-making problem, in our application a vehicle routing problem (VRP). Seeking a balance between online and offline computation, we propose to train a predictor of the solution costs to the VRPs using supervised learning. In turn, we use the predictions online in approximate dynamic programming and reinforcement learning algorithms to solve the booking control problem. We compare the results to an existing approach in the literature and show that we are able to obtain control policies that provide increased profit at a reduced evaluation time. This is achieved thanks to accurate approximation of the operational costs and negligible computing time in comparison to solving the VRPs.



There are no comments yet.


page 1

page 2

page 3

page 4


Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information

This paper offers a methodological contribution at the intersection of m...

AI-CARGO: A Data-Driven Air-Cargo Revenue Management System

We propose AI-CARGO, a revenue management system for air-cargo that comb...

Machine Learning with Operational Costs

This work proposes a way to align statistical modeling with decision mak...

A Comparative Study of Single-Constraint Routing in Wireless Mesh Networks Using Different Dynamic Programming Algorithms

Finding the shortest route in wireless mesh networks is an important asp...

Inventory Routing Problem with Facility Location

We study problems that integrate depot location decisions along with the...

Network Revenue Management with Limited Switches: Known and Unknown Demand Distributions

This work is motivated by a practical concern from our retail partner. W...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Revenue management (RM) is of importance in many commercial applications such as airline cargo, hotels, and attended home delivery (see, e.g., the survey [10]). In general, RM focuses on the decisions regarding the distribution of products or services with the goal of maximizing the total profit or revenue. The focus of this work is on quantity decisions in the context of a booking control problem, where a set of requests are observed across a time horizon. Each request can either be accepted or rejected. At the end of the booking period the set of accepted requests needs to be fulfilled, this corresponds to an operational decision-making problem

. The booking control problem can be formulated as a Markov Decision Process (MDP), which describes the relationship between the revenue from accepting a request, the decrease in capacity for future requests, and the cost associated with the fulfillment of the accepted requests. Although the MDP captures the problem structure, solving it is often intractable due to the curse of dimensionality. As such, approximate methods have been widely adopted for a range of RM problems. Bid-price and booking-limit control policies as described in

[22] are among the most used methods for capacity control problems.

Outside of the context of RM, reinforcement learning (RL), has seen success in a variety of challenging control problems, such as Atari games [16] and Go [20]. For a detailed overview of RL we refer the reader to [4, 21]. Despite the success of RL in applications with intractable state spaces, the applications within RM have mainly been limited to airline seat allocation problems [5, 8, 12]

. A major limitation for the direct application of RL to capacity control problems is in the time required for simulating the system. Indeed, the computational cost associated with solving the end-of-horizon operational decision-making problem is non-negligible, which leads to a prohibitively expensive computational cost for a simulation-based approach. This limitation can be observed in applications such as the vector packing in airline cargo management

[3, 13, 14] or the vehicle routing (VRP) in distribution logistics [7]. To overcome this, we propose an approximation to the operational cost via supervised learning. We then leverage the resulting prediction function within approximate dynamic programming (DP) and RL algorithms and evaluate our approach on an application in distribution logistics.

In this paper, we make the following contributions: (1) We propose a methodology that (i) formulates an approximate MDP by replacing the formulation of the operational decision-making problem with a prediction function defined by offline supervised learning and (ii) uses approximate DP and RL techniques to obtain approximate control policies. (2) We apply the proposed methodology to a distribution logistics problem from the literature [7] and we show computationally that our policies provide increased profit at a reduced evaluation time.

The remainder of the paper is organized as follows. In Section 2, we introduce our methodology. In Section 3, we introduce an application to distribution logistics as well as the considerations made in order to formulate the supervised learning task. In Section 4, we provide results of various control policies and compare to baselines. Finally, Section 5 concludes.

2 Methodology

In this section, we first introduce a general MDP formulation of our booking control problem by following closely the notation in the literature, e.g., [22]. Second, we describe a formulation based on an approximation of the operational costs.

2.1 Booking Control Problem Formulation

Let denote a set of requests with cardinality . The decision-making problem is defined over periods indexed by

. The probability that request

is made at time is given by and the probability that no request occurs at period is given by . The time intervals are defined to be small enough such that at most one request is received in each period. The revenue associated with accepting a request is .

To formulate this as an MDP, we let denote the number of requests accepted before time and , denote the state vector where the -th index is given by . The decision variables are denoted by , where if an offer for request is accepted at time . The deterministic transition is given by , with with a -th component equal to 1 and every other equal to 0.

An operational decision-making problem occurs at the end of the booking period and is related to the fulfilment of the accepted requests. This problem only depends on the state at the end of the time horizon, . We denote operational cost by .

With the above definitions, we now define the dynamic program that represents the maximal expected profit from period on-ward. The value function is denoted by and is given by


with .

2.2 Approximation

Solving the MDP (1)–(2) can be intractable even for small instances. For this reason, approximate DP or RL appears to be a natural choice. However, these algorithms typically rely on some form of policy evaluation that consists in simulating trajectories of the system under a given policy. In our context, such policy evaluation is computationally costly due to (2). To overcome this, we propose the use of an approximation of . We first introduce a mapping from the state at the end of the time horizon to an dimensional representation with the function . Then, we define an approximation of as and an approximate MDP formulation is then given by


2.3 Supervised Learning

The approximation in (4

) can be defined in various ways, such as a problem specific heuristic, a mixed integer program (MIP) solved to a time limit or optimality gap or predicted by machine learning (ML). Here, we focus on ML in combination with a heuristic. Specifically, we propose to train a supervised learning model offline to separate the problem of accurately predicting

from solving (3)–(4).

To train a supervised learning model, we require a feature mapping from the state to the input to the model and a set of labeled states at the end of the time horizon. The feature mapping is dependant on the application, but in general is viewed as the function in (4). In Section 3 we describe the specific set of features we use for the application in distribution logistics.

To obtain labeled data we simulate trajectories in the system using a stationary random policy which accepts a request with given probability . At the end of the time horizon we obtain a state and compute . We then repeat this process for different values of . The idea is to have a representation of feasible final states in the data (optimal and sub-optimal ones). We denote the set of labeled data as .

3 Application in Distribution Logistics

We consider the distribution logistics booking control problem described in [7]. In this context booking requests correspond to pickup activities and the operational decision-making problem to a VRP. Each pickup request has an associated location and revenue. The cost incurred at the end of the booking period is the cost of the VRP solution and hence depends on all the accepted requests.

The problem can be formulated as (1)–(2). We now detail how we solve the VRP to obtain solutions that are comparable to those in [7]. We assume that there are a fixed number of vehicles, each with capacity . We also assume that the depot is at location . The set of all nodes is given by , i.e., the union of the location requests and the depot. The set of arcs is given by and denoted as . The cost of an arc is given by , for . The optimal objective value is denoted by . If more than vehicles are required, then we allow for additional outsourcing vehicles to be used at an additional fixed cost .

We choose large enough such that the cost for adding an additional vehicle is larger than any potential revenue from the requests it can fulfill. Finally, the operational cost is given by


The VRP formulation is provided in Appendix A.1.

Sets of Instances.

We generate 4 sets of instances with 4, 10, 15, and 50 locations. The locations were determined uniformly at random. The locations are split into groups, where the revenue in each group differs. The request probabilities are defined such that locations with a higher revenue have a greater probability of occurring later in the booking period. For a detailed description of the parameters that describe each set of instances, see Appendix A.2.


The features are computed from the state at the end of the time horizon,

. For the sake of simplicity, we consider a fixed-size input structure for the ML model. The number of locations can vary depending on the set of accepted requests. We therefore derive features from capacity, depot location, total number of accepted requests per location and aggregate statistics of the locations. For the latter, we use the distance between locations and the depot, and relative distances between locations. For each of these, we compute the min, max, mean, median, standard deviation, and 1st/3rd quartiles.

Prediction task.

We seek an accurate approximation of (5) that is fast to compute. For this purpose we use ML to predict

(in this work train random forest models

[6]) and we compute the outsourcing cost, , with a bin-packing solver (MTP, [15]).

Data for Supervised Learning.

We generate one data set for each of the four sets of instances (4, 10, 15 and 50 locations) using the algorithm described in Section 2.3, with . To compute a label for each instance in a reasonable time we use FILO [1], a heuristic solver for VRPs. To ensure the VRP solutions make use of a minimal number of vehicles, we offset the depot location. We note that the data generation is fast. It takes less than five minutes for the 4-locations data set and 40 minutes for the 50-locations data set.

4 Results for the Distribution Logistics Application

In this section, we start by presenting supervised learning performance metrics followed by the experimental setup and results on the booking control problem. Experiments were run on an Intel Core i7-10700 2.90GHz with 32GB RAM.

4.1 Supervised Learning Results

We partition each of the data sets into training/validation and test sets. It takes between 0.19 and 1.88 seconds to train the random forest models.

We assess the prediction performance using two performance metrics: mean squared error (MSE) and mean absolute error (MAE). The results reported in Table 1 show that we achieve relatively good performance but it deteriorates with the size of the instances. In particular, we note that the MSE (this metric penalizes large errors more severely than MAE) is quite large for the 50-location test data. Although labels are of larger magnitude when the number of locations increase so an increase in both MSE and MAE is expected, these results can potentially be improved by generating larger data sets and using more flexible ML models.

Locations     Training MSE Test MSE Training MAE Test MAE
4 1,000 250 1.82 5.30 0.70 1.24
10 2,000 500 3.12 13.80 1.26 2.79
15 2,000 500 1.45 11.17 0.90 2.53
50 2,000 500 23.22 139.41 3.68 9.24
Table 1: Supervised Learning Performance Metrics

4.2 Baselines

We benchmark our results with respect to three baseline policies. The booking limit policy (BLP) and booking limit policy with reoptimization (BLPR) as described in [7] and implemented using SCIP [2] as the MIP solver, without row generation. As a third baseline we use the stationary random policy (rand-) with acceptance probability giving the highest mean profit (we use the same values for as for the data generation).

It is possible to solve the smallest instances with exact DP. So in this case we report results for using the exact algorithm combined with solving each operational problem with FILO (DP-Exact) and with the predicted costs (DP-ML). For the problems with more than 10 locations our implementation in SCIP did not find incumbent solutions within a reasonable time, so the BLP and BLPR baselines are omitted for the 15 and 50 location instances.

4.3 Algorithms

We consider a standard set of RL and approximate DP algorithms to obtain approximate control policies: SARSA [19] as well as Monte-Carlo tree search (MCTS) with Upper Confidence Bounds Applied to Trees (UCT) [11]. These algorithms all rely on a set of simulated trajectories to evaluate the expected profit in (3). Specifically, we consider SARSA with neural state approximation. We define two different variants of the MCTS algorithm distinguished by their base policy: one uses a random policy (MCTS-rand-X) and the other SARSA (MCTS-SARSA-X). Here, “X” denotes the number of iterations each algorithm uses for simulation and in our experiments we use 30 and 100 simulations per state. Each of the above algorithms uses our approximation of (5) when computing a policy (i.e., the sum of the predicted operational cost and the outsourcing cost computed with MTP). However, we use FILO in the last step to compute the final operational cost.

4.4 Control Problem Results

To compare performance, we evaluate each method over the same 50 realizations of requests for each of the sets of instances. We start by analyzing solution quality followed by computing times.

In Figure 1, we provide box plots to show the distribution of the percentage gaps to the best known solution (i.e., the highest profit solution for each instance) for each algorithm. Figure 0(a) shows that, as expected, DP-Exact has the lowest gaps with a median at zero. The same algorithm but with approximate operational costs (DP-ML) results in gaps close to those of DP-Exact, demonstrating the effectiveness of predicting the operational cost. Comparing our approaches to the baselines BLP and BLPR in Figures 0(a) and 0(b), we can see that the policies obtained via any of the proposed control algorithms achieve smaller gaps than those of BLP, BLPR, and rand- baselines. Moreover, the BLP and BLPR policies report larger gaps than those of rand-

for 10 locations. For all sets of instances, the MCTS algorithms consistently achieve the smallest gaps, with some variations depending on the number of simulations and the base policy. As expected, larger number of simulations lead to smaller variance.

(a) 4 Locations
(b) 10 Locations
(c) 15 Locations
(d) 50 Locations
Figure 1: Box plots of optimality gaps to best known solutions

We now turn our attention to the analysis of computing times (reported in details in Appendix A.3). For this purpose, we distinguish between offline and online computing time and we focus the analysis on the latter. For DP-Exact, DP-ML, SARSA and MCTS-SARSA-X, we compute an exact or approximate value function offline. Similarly, the initial booking limit policy (BLP and BLRP) is computed offline, while any reoptimization of the booking limits and the solution of the VRP at the end of the time horizon contribute to the online computing time. For MCTS-rand-X and rand-, the policies are computed entirely online. The time associated with generating data and training the ML models adds to the offline computing time of the corresponding algorithms.

As expected, the online computing times are the shortest for SARSA and rand- with an average of less than 2 seconds for all sets of instances. On the 10-location instances, the bid price policies (BLP and BLPR) have an average online computing time comparable to MCTS-rand-30 and MCTS-SARSA-30 (10.93 and 12.63 seconds compared to 9.06 and 11.81 seconds, respectively). The computing time of the MCTS algorithms depend on the instance size. On average, they approximately double for 15-locations compared to 10-locations and the 50-location instances are on average 8 times more time consuming to solve than the 10-location ones in the case of MCTS-rand-X (11 times in the case of MCTS-SARSA-X). This leads to average online computing times between 1.2 (MTCS-rand-30) to 7.5 minutes (MCTS-SARSA-100) for the largest instances. We note that the latter is the algorithm achieving the highest quality solutions. However, using 100 simulations instead of 30 increases the online computing time by approximately a factor of 4 for all sets of instances.

The trade-off between solution quality and online computing time becomes clearly visible for the larger instances (15 and 50 locations). As highlighted in Figure 1, this trade-off can be partly controlled by the simulation budget. Noteworthy is the performance of SARSA on the largest instances: On average, the gap to the best known solution is 4.7% and the online computing time is only 1.31 seconds. In Appendix A.3, Table 2 reports the data generation and training times, and Table 3 reports mean profit, offline, and online computing times.

Finally, we comment on the number of evaluations of the operational costs. For the sake of illustration, we compare DP-Exact and DP-ML. The former calls FILO 10,000 times to compute the value functions (offline), while DP-ML only solves 1,000 VRPs when generating data for supervised learning. The gain is even more important in the context of SARSA that requires 25,000 cost evaluations during offline training. The MCTS-based algorithms require a number of cost evaluations (online) proportional to the number of simulations times .

5 Conclusion

In the context of complex RM problems where the quality of the booking policies hinges on accurate evaluation of operational costs, we have proposed a methodology that formulates an approximate MDP by replacing the formulation of the operational decision-making problem with a prediction function defined by offline supervised learning. We used approximate DP and RL to obtain approximate control policies. We have applied the methodology to the distribution logistics problem in [7] and we have shown computationally that our policies provide increased profit at a reduced evaluation time. This is because our algorithm strikes a balance between the online and offline computation. Accurate predictions from the ML model combined with a bin-packing heuristic were used to evaluate approximate operational costs online in computing times that are negligible in comparison to solving the VRPs.


We are grateful to Luca Accorsi and Daniele Vigo who provided us access to their VRP code FILO [1], which has been very useful for our research. We would also like to thank Francesca Guerriero for sharing details on the implementation of the approach in [7].


  • [1] Luca Accorsi and Daniele Vigo. A fast and scalable heuristic for the solution of large-scale capacitated vehicle routing problems. Technical report, University of Bologna, 2020.
  • [2] Tobias Achterberg. Scip: Solving constraint integer programs. Mathematical Programming Computation, 1:1–41, 2009.
  • [3] Christiane Barz and Daniel Gartner. Air cargo network revenue management. Transportation Science, 50(4):1206–1222, 2016.
  • [4] Dimitri P Bertsekas. Reinforcement learning and optimal control. Athena Scientific Belmont, MA, 2019.
  • [5] Nicolas Bondoux, Anh Quan Nguyen, Thomas Fiig, and Rodrigo Acuna-Agost. Reinforcement learning applied to airline revenue management. Journal of Revenue and Pricing Management, 19(5):332–348, 2020.
  • [6] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
  • [7] Giovanni Giallombardo, Francesca Guerriero, and Giovanna Miglionico. Profit maximization via capacity control for distribution logistics problems. arXiv preprint arXiv:2008.03216, 2020.
  • [8] Abhuit Gosavi, Naveen Bandla, and Tapas K. Das. A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Transactions, 34(9):729–742, 2002.
  • [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [10] Robert Klein, Sebastian Koch, Claudius Steinhardt, and Arne K. Strauss. A review of revenue management: Recent generalizations and advances in industry applications. European Journal of Operational Research, 284(2):397–412, 2020.
  • [11] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Lecture Notes in Computer Science, pages 282–293. Springer Berlin Heidelberg, 2006.
  • [12] Ryan J. Lawhead and Abhijit Gosavi. A bounded actor–critic reinforcement learning algorithm applied to airline revenue management.

    Engineering Applications of Artificial Intelligence

    , 82:252 – 262, 2019.
  • [13] Yuri Levin, Mikhail Nediak, and Huseyin Topaloglu. Cargo capacity management with allotments and spot market demand. Operations Research, 60(2):351–365, 2012.
  • [14] Tatsiana Levina, Yuri Levin, Jeff McGill, and Mikhail Nediak. Network cargo capacity management. Operations Research, 59(4):1008–1023, 2011.
  • [15] Silvano Martello. Knapsack problems: algorithms and computer implementations. Wiley-Interscience series in discrete mathematics and optimization, 1990.
  • [16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [17] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pages 8026–8037. Curran Associates, Inc., 2019.
  • [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [19] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
  • [20] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis.

    Mastering the game of go with deep neural networks and tree search.

    Nature, 529(7587):484–489, 2016.
  • [21] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.
  • [22] Kalyan T. Talluri and Garrett J. Van Ryzin. The Theory and Practice of Revenue Management. Springer US, 2004.

Appendix A Details on the Experimental Study and Results

In the Appendix, we report detailed information on the experimental study and the results discussed in Section 4.

a.1 VRP Formulation

In this section, we provide a MIP formulation for the VRP, which is similar to that in [7], with the exception that the problem is constrained by .

s.t. (7)

The decision variables are , , and

: The binary variable

equals if vehicle uses arc . The binary variable equals if vehicle visits node . The quantity of accepted requests from vehicle at location is given by the non-negative variable . Objective function (6) minimizes the routing cost. Constraints (7) and (8) ensure that the each vehicle goes in and out of the visited nodes. Constraints (9) assert that each vehicle can visit a node at most once, and (10) limits the number of vehicles to at most . Constraints (11) ensure that every tour is connected, and (12) restrict vehicle from collecting more than the capacity at location . Finally, Constraints (13) guarantee that the capacity is not exceeded, and (14) ensure that the accepted requests at each location are fulfilled.

a.2 Instances

In this section, we detail the parameters that define the 4 sets of instances. In each of the instances we determine the capacity to be proportional to the inverse demand of the locations. Specifically, we use a load factor , and determine the capacity by


4 Locations:

The locations given by , the number of periods , the revenue for accepting a request from each location are defined by , , , . The probability of no request is , for . The initial request probabilities for each location are , , , . The remainder of the request probabilities are then given by for and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .

10 Locations:

The locations given by , the number of periods , the revenue for accepting a request from each location are defined by for , for , and for . The probability of no request is , for . The initial request probabilities for each location are for , for , and for . The remainder of the request probabilities are then given by for , for , and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .

15 Locations:

The locations given by , the number of periods , the revenue for accepting a request from each location are defined by for , for , and for . The probability of no request is , for . The initial request probabilities for each location are for , for , and for . The remainder of the request probabilities are then given by for , for , and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .

50 Locations:

The locations given by , the number of periods , the revenue for accepting a request from each location are defined by for , for , and for . The probability of no request is , for . The initial request probabilities for each location are for , for , and for . The remainder of the request probabilities are then given by for , for , and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .

a.3 Mean Profit and Computing Times

Table 2 reports the time required to generate data and train supervised learning model. These times required for all algorithms but BLP, BLPR, and rand-.

Locations Data Generation Time Training Time
4 285.77 0.19
10 1577.34 0.76
15 2698.57 0.90
50 2379.68 1.88
Table 2: Data generation and supervised learning times. All times in seconds.

Tables 3 reports the mean profit obtained by each approach as well as the offline and online computing times as described in Section 4.

Locations Method Mean Profit Online Time Offline Time
4 DP-Exact 113.88 0.28 2874.83
DP-ML 113.13 0.30 29.23
SARSA 109.33 0.27 558.18
MCTS-rand-30 109.58 5.47 -
MCTS-rand-100 112.23 22.01 -
MCTS-SARSA–30 111.40 6.35 558.18
MCTS-SARSA-100 111.03 24.46 558.18
BLP 78.35 0.02 0.01
BLPR 82.81 0.03 0.01
rand-0.6 71.49 0.32 -
10 SARSA 189.25 0.91 1040.38
MCTS-rand-30 199.45 9.06 -
MCTS-rand-100 201.95 36.04 -
MCTS-SARSA–30 194.54 11.81 1040.38
MCTS-SARSA-100 197.11 43.77 1040.38
BLP 118.94 10.93 2.84
BLPR 130.01 12.63 2.84
rand-0.7 158.64 0.95 -
15 SARSA 400.43 1.72 1640.48
MCTS-rand-30 422.19 18.30 -
MCTS-rand-100 425.14 74.87 -
MCTS-SARSA–30 423.38 25.32 1640.48
MCTS-SARSA-100 421.59 97.51 1640.48
rand-0.7 357.95 1.55 -
50 SARSA 1098.78 1.31 7045.92
MCTS-rand-30 1095.02 74.92 -
MCTS-rand-100 1118.60 296.10 -
MCTS-SARSA–30 1112.45 123.47 7045.92
MCTS-SARSA-100 1153.71 450.05 7045.92
rand-0.7 862.52 1.04 -
Table 3: Mean profits and computing times. All times in seconds.

a.4 Model Parameters

For supervised learning, we use the implementation of random forests from scikit-learn [18]. We note that our results were obtained using the default parameters provided in scikit-learn and changing model parameters did not have a significant impact on results.

For SARSA, we implement exploration by taking a random action with probability . We use Pytorch [17] to implement the neural value function approximation. For all sets of instances, we use Adam optimizer [9] with MSE loss. We use a neural network with one hidden layer and learning rate that vary depending on the problem setting. The parameters for each instance are reported in Table 4. In each setting, we train SARSA for 25,000 iterations and evaluate the mean profit over 50 validation instances every 100 episodes. We then use the SARSA model that obtained the highest mean profit in the remainder of our experiments.

Locations Hidden layer dimension Learning rate
4 128
10 256
15 256
50 1024
Table 4:

Neural network hyperparameters

For MCTS, the UCT hyperparameter for each instance and algorithm is provided in Table 5. To determine these values, we did a set of experiments to see what achieved the best quality solutions on a small number of validation instances.

Locations MCTS-rand-30 MCTS-rand-100 MCTS-SARSA-30 MCTS-SARSA-100
4 1.0 1.0 1.0 10.0
10 1.0 1.0 0.001 0.001
15 100.0 100.0 100.0 1.0
50 10.0 100.0 10.0 100.0
Table 5: MCTS UCT hyperparameter