1 Introduction
Revenue management (RM) is of importance in many commercial applications such as airline cargo, hotels, and attended home delivery (see, e.g., the survey [10]). In general, RM focuses on the decisions regarding the distribution of products or services with the goal of maximizing the total profit or revenue. The focus of this work is on quantity decisions in the context of a booking control problem, where a set of requests are observed across a time horizon. Each request can either be accepted or rejected. At the end of the booking period the set of accepted requests needs to be fulfilled, this corresponds to an operational decisionmaking problem
. The booking control problem can be formulated as a Markov Decision Process (MDP), which describes the relationship between the revenue from accepting a request, the decrease in capacity for future requests, and the cost associated with the fulfillment of the accepted requests. Although the MDP captures the problem structure, solving it is often intractable due to the curse of dimensionality. As such, approximate methods have been widely adopted for a range of RM problems. Bidprice and bookinglimit control policies as described in
[22] are among the most used methods for capacity control problems.Outside of the context of RM, reinforcement learning (RL), has seen success in a variety of challenging control problems, such as Atari games [16] and Go [20]. For a detailed overview of RL we refer the reader to [4, 21]. Despite the success of RL in applications with intractable state spaces, the applications within RM have mainly been limited to airline seat allocation problems [5, 8, 12]
. A major limitation for the direct application of RL to capacity control problems is in the time required for simulating the system. Indeed, the computational cost associated with solving the endofhorizon operational decisionmaking problem is nonnegligible, which leads to a prohibitively expensive computational cost for a simulationbased approach. This limitation can be observed in applications such as the vector packing in airline cargo management
[3, 13, 14] or the vehicle routing (VRP) in distribution logistics [7]. To overcome this, we propose an approximation to the operational cost via supervised learning. We then leverage the resulting prediction function within approximate dynamic programming (DP) and RL algorithms and evaluate our approach on an application in distribution logistics.In this paper, we make the following contributions: (1) We propose a methodology that (i) formulates an approximate MDP by replacing the formulation of the operational decisionmaking problem with a prediction function defined by offline supervised learning and (ii) uses approximate DP and RL techniques to obtain approximate control policies. (2) We apply the proposed methodology to a distribution logistics problem from the literature [7] and we show computationally that our policies provide increased profit at a reduced evaluation time.
The remainder of the paper is organized as follows. In Section 2, we introduce our methodology. In Section 3, we introduce an application to distribution logistics as well as the considerations made in order to formulate the supervised learning task. In Section 4, we provide results of various control policies and compare to baselines. Finally, Section 5 concludes.
2 Methodology
In this section, we first introduce a general MDP formulation of our booking control problem by following closely the notation in the literature, e.g., [22]. Second, we describe a formulation based on an approximation of the operational costs.
2.1 Booking Control Problem Formulation
Let denote a set of requests with cardinality . The decisionmaking problem is defined over periods indexed by
. The probability that request
is made at time is given by and the probability that no request occurs at period is given by . The time intervals are defined to be small enough such that at most one request is received in each period. The revenue associated with accepting a request is .To formulate this as an MDP, we let denote the number of requests accepted before time and , denote the state vector where the th index is given by . The decision variables are denoted by , where if an offer for request is accepted at time . The deterministic transition is given by , with with a th component equal to 1 and every other equal to 0.
An operational decisionmaking problem occurs at the end of the booking period and is related to the fulfilment of the accepted requests. This problem only depends on the state at the end of the time horizon, . We denote operational cost by .
With the above definitions, we now define the dynamic program that represents the maximal expected profit from period onward. The value function is denoted by and is given by
(1) 
(2) 
with .
2.2 Approximation
Solving the MDP (1)–(2) can be intractable even for small instances. For this reason, approximate DP or RL appears to be a natural choice. However, these algorithms typically rely on some form of policy evaluation that consists in simulating trajectories of the system under a given policy. In our context, such policy evaluation is computationally costly due to (2). To overcome this, we propose the use of an approximation of . We first introduce a mapping from the state at the end of the time horizon to an dimensional representation with the function . Then, we define an approximation of as and an approximate MDP formulation is then given by
(3) 
(4) 
2.3 Supervised Learning
The approximation in (4
) can be defined in various ways, such as a problem specific heuristic, a mixed integer program (MIP) solved to a time limit or optimality gap or predicted by machine learning (ML). Here, we focus on ML in combination with a heuristic. Specifically, we propose to train a supervised learning model offline to separate the problem of accurately predicting
from solving (3)–(4).To train a supervised learning model, we require a feature mapping from the state to the input to the model and a set of labeled states at the end of the time horizon. The feature mapping is dependant on the application, but in general is viewed as the function in (4). In Section 3 we describe the specific set of features we use for the application in distribution logistics.
To obtain labeled data we simulate trajectories in the system using a stationary random policy which accepts a request with given probability . At the end of the time horizon we obtain a state and compute . We then repeat this process for different values of . The idea is to have a representation of feasible final states in the data (optimal and suboptimal ones). We denote the set of labeled data as .
3 Application in Distribution Logistics
We consider the distribution logistics booking control problem described in [7]. In this context booking requests correspond to pickup activities and the operational decisionmaking problem to a VRP. Each pickup request has an associated location and revenue. The cost incurred at the end of the booking period is the cost of the VRP solution and hence depends on all the accepted requests.
The problem can be formulated as (1)–(2). We now detail how we solve the VRP to obtain solutions that are comparable to those in [7]. We assume that there are a fixed number of vehicles, each with capacity . We also assume that the depot is at location . The set of all nodes is given by , i.e., the union of the location requests and the depot. The set of arcs is given by and denoted as . The cost of an arc is given by , for . The optimal objective value is denoted by . If more than vehicles are required, then we allow for additional outsourcing vehicles to be used at an additional fixed cost .
We choose large enough such that the cost for adding an additional vehicle is larger than any potential revenue from the requests it can fulfill. Finally, the operational cost is given by
(5) 
The VRP formulation is provided in Appendix A.1.
Sets of Instances.
We generate 4 sets of instances with 4, 10, 15, and 50 locations. The locations were determined uniformly at random. The locations are split into groups, where the revenue in each group differs. The request probabilities are defined such that locations with a higher revenue have a greater probability of occurring later in the booking period. For a detailed description of the parameters that describe each set of instances, see Appendix A.2.
Features.
The features are computed from the state at the end of the time horizon,
. For the sake of simplicity, we consider a fixedsize input structure for the ML model. The number of locations can vary depending on the set of accepted requests. We therefore derive features from capacity, depot location, total number of accepted requests per location and aggregate statistics of the locations. For the latter, we use the distance between locations and the depot, and relative distances between locations. For each of these, we compute the min, max, mean, median, standard deviation, and 1st/3rd quartiles.
Prediction task.
We seek an accurate approximation of (5) that is fast to compute. For this purpose we use ML to predict
(in this work train random forest models
[6]) and we compute the outsourcing cost, , with a binpacking solver (MTP, [15]).Data for Supervised Learning.
We generate one data set for each of the four sets of instances (4, 10, 15 and 50 locations) using the algorithm described in Section 2.3, with . To compute a label for each instance in a reasonable time we use FILO [1], a heuristic solver for VRPs. To ensure the VRP solutions make use of a minimal number of vehicles, we offset the depot location. We note that the data generation is fast. It takes less than five minutes for the 4locations data set and 40 minutes for the 50locations data set.
4 Results for the Distribution Logistics Application
In this section, we start by presenting supervised learning performance metrics followed by the experimental setup and results on the booking control problem. Experiments were run on an Intel Core i710700 2.90GHz with 32GB RAM.
4.1 Supervised Learning Results
We partition each of the data sets into training/validation and test sets. It takes between 0.19 and 1.88 seconds to train the random forest models.
We assess the prediction performance using two performance metrics: mean squared error (MSE) and mean absolute error (MAE). The results reported in Table 1 show that we achieve relatively good performance but it deteriorates with the size of the instances. In particular, we note that the MSE (this metric penalizes large errors more severely than MAE) is quite large for the 50location test data. Although labels are of larger magnitude when the number of locations increase so an increase in both MSE and MAE is expected, these results can potentially be improved by generating larger data sets and using more flexible ML models.
Locations  Training MSE  Test MSE  Training MAE  Test MAE  

4  1,000  250  1.82  5.30  0.70  1.24 
10  2,000  500  3.12  13.80  1.26  2.79 
15  2,000  500  1.45  11.17  0.90  2.53 
50  2,000  500  23.22  139.41  3.68  9.24 
4.2 Baselines
We benchmark our results with respect to three baseline policies. The booking limit policy (BLP) and booking limit policy with reoptimization (BLPR) as described in [7] and implemented using SCIP [2] as the MIP solver, without row generation. As a third baseline we use the stationary random policy (rand) with acceptance probability giving the highest mean profit (we use the same values for as for the data generation).
It is possible to solve the smallest instances with exact DP. So in this case we report results for using the exact algorithm combined with solving each operational problem with FILO (DPExact) and with the predicted costs (DPML). For the problems with more than 10 locations our implementation in SCIP did not find incumbent solutions within a reasonable time, so the BLP and BLPR baselines are omitted for the 15 and 50 location instances.
4.3 Algorithms
We consider a standard set of RL and approximate DP algorithms to obtain approximate control policies: SARSA [19] as well as MonteCarlo tree search (MCTS) with Upper Confidence Bounds Applied to Trees (UCT) [11]. These algorithms all rely on a set of simulated trajectories to evaluate the expected profit in (3). Specifically, we consider SARSA with neural state approximation. We define two different variants of the MCTS algorithm distinguished by their base policy: one uses a random policy (MCTSrandX) and the other SARSA (MCTSSARSAX). Here, “X” denotes the number of iterations each algorithm uses for simulation and in our experiments we use 30 and 100 simulations per state. Each of the above algorithms uses our approximation of (5) when computing a policy (i.e., the sum of the predicted operational cost and the outsourcing cost computed with MTP). However, we use FILO in the last step to compute the final operational cost.
4.4 Control Problem Results
To compare performance, we evaluate each method over the same 50 realizations of requests for each of the sets of instances. We start by analyzing solution quality followed by computing times.
In Figure 1, we provide box plots to show the distribution of the percentage gaps to the best known solution (i.e., the highest profit solution for each instance) for each algorithm. Figure 0(a) shows that, as expected, DPExact has the lowest gaps with a median at zero. The same algorithm but with approximate operational costs (DPML) results in gaps close to those of DPExact, demonstrating the effectiveness of predicting the operational cost. Comparing our approaches to the baselines BLP and BLPR in Figures 0(a) and 0(b), we can see that the policies obtained via any of the proposed control algorithms achieve smaller gaps than those of BLP, BLPR, and rand baselines. Moreover, the BLP and BLPR policies report larger gaps than those of rand
for 10 locations. For all sets of instances, the MCTS algorithms consistently achieve the smallest gaps, with some variations depending on the number of simulations and the base policy. As expected, larger number of simulations lead to smaller variance.
We now turn our attention to the analysis of computing times (reported in details in Appendix A.3). For this purpose, we distinguish between offline and online computing time and we focus the analysis on the latter. For DPExact, DPML, SARSA and MCTSSARSAX, we compute an exact or approximate value function offline. Similarly, the initial booking limit policy (BLP and BLRP) is computed offline, while any reoptimization of the booking limits and the solution of the VRP at the end of the time horizon contribute to the online computing time. For MCTSrandX and rand, the policies are computed entirely online. The time associated with generating data and training the ML models adds to the offline computing time of the corresponding algorithms.
As expected, the online computing times are the shortest for SARSA and rand with an average of less than 2 seconds for all sets of instances. On the 10location instances, the bid price policies (BLP and BLPR) have an average online computing time comparable to MCTSrand30 and MCTSSARSA30 (10.93 and 12.63 seconds compared to 9.06 and 11.81 seconds, respectively). The computing time of the MCTS algorithms depend on the instance size. On average, they approximately double for 15locations compared to 10locations and the 50location instances are on average 8 times more time consuming to solve than the 10location ones in the case of MCTSrandX (11 times in the case of MCTSSARSAX). This leads to average online computing times between 1.2 (MTCSrand30) to 7.5 minutes (MCTSSARSA100) for the largest instances. We note that the latter is the algorithm achieving the highest quality solutions. However, using 100 simulations instead of 30 increases the online computing time by approximately a factor of 4 for all sets of instances.
The tradeoff between solution quality and online computing time becomes clearly visible for the larger instances (15 and 50 locations). As highlighted in Figure 1, this tradeoff can be partly controlled by the simulation budget. Noteworthy is the performance of SARSA on the largest instances: On average, the gap to the best known solution is 4.7% and the online computing time is only 1.31 seconds. In Appendix A.3, Table 2 reports the data generation and training times, and Table 3 reports mean profit, offline, and online computing times.
Finally, we comment on the number of evaluations of the operational costs. For the sake of illustration, we compare DPExact and DPML. The former calls FILO 10,000 times to compute the value functions (offline), while DPML only solves 1,000 VRPs when generating data for supervised learning. The gain is even more important in the context of SARSA that requires 25,000 cost evaluations during offline training. The MCTSbased algorithms require a number of cost evaluations (online) proportional to the number of simulations times .
5 Conclusion
In the context of complex RM problems where the quality of the booking policies hinges on accurate evaluation of operational costs, we have proposed a methodology that formulates an approximate MDP by replacing the formulation of the operational decisionmaking problem with a prediction function defined by offline supervised learning. We used approximate DP and RL to obtain approximate control policies. We have applied the methodology to the distribution logistics problem in [7] and we have shown computationally that our policies provide increased profit at a reduced evaluation time. This is because our algorithm strikes a balance between the online and offline computation. Accurate predictions from the ML model combined with a binpacking heuristic were used to evaluate approximate operational costs online in computing times that are negligible in comparison to solving the VRPs.
Acknowledgments
References
 [1] Luca Accorsi and Daniele Vigo. A fast and scalable heuristic for the solution of largescale capacitated vehicle routing problems. Technical report, University of Bologna, 2020.
 [2] Tobias Achterberg. Scip: Solving constraint integer programs. Mathematical Programming Computation, 1:1–41, 2009.
 [3] Christiane Barz and Daniel Gartner. Air cargo network revenue management. Transportation Science, 50(4):1206–1222, 2016.
 [4] Dimitri P Bertsekas. Reinforcement learning and optimal control. Athena Scientific Belmont, MA, 2019.
 [5] Nicolas Bondoux, Anh Quan Nguyen, Thomas Fiig, and Rodrigo AcunaAgost. Reinforcement learning applied to airline revenue management. Journal of Revenue and Pricing Management, 19(5):332–348, 2020.
 [6] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
 [7] Giovanni Giallombardo, Francesca Guerriero, and Giovanna Miglionico. Profit maximization via capacity control for distribution logistics problems. arXiv preprint arXiv:2008.03216, 2020.
 [8] Abhuit Gosavi, Naveen Bandla, and Tapas K. Das. A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Transactions, 34(9):729–742, 2002.
 [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [10] Robert Klein, Sebastian Koch, Claudius Steinhardt, and Arne K. Strauss. A review of revenue management: Recent generalizations and advances in industry applications. European Journal of Operational Research, 284(2):397–412, 2020.
 [11] Levente Kocsis and Csaba Szepesvári. Bandit based montecarlo planning. In Lecture Notes in Computer Science, pages 282–293. Springer Berlin Heidelberg, 2006.

[12]
Ryan J. Lawhead and Abhijit Gosavi.
A bounded actor–critic reinforcement learning algorithm applied to
airline revenue management.
Engineering Applications of Artificial Intelligence
, 82:252 – 262, 2019.  [13] Yuri Levin, Mikhail Nediak, and Huseyin Topaloglu. Cargo capacity management with allotments and spot market demand. Operations Research, 60(2):351–365, 2012.
 [14] Tatsiana Levina, Yuri Levin, Jeff McGill, and Mikhail Nediak. Network cargo capacity management. Operations Research, 59(4):1008–1023, 2011.
 [15] Silvano Martello. Knapsack problems: algorithms and computer implementations. WileyInterscience series in discrete mathematics and optimization, 1990.
 [16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [17] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pages 8026–8037. Curran Associates, Inc., 2019.
 [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [19] Gavin A Rummery and Mahesan Niranjan. Online Qlearning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.

[20]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal
Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray
Kavukcuoglu, Thore Graepel, and Demis Hassabis.
Mastering the game of go with deep neural networks and tree search.
Nature, 529(7587):484–489, 2016.  [21] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.
 [22] Kalyan T. Talluri and Garrett J. Van Ryzin. The Theory and Practice of Revenue Management. Springer US, 2004.
Appendix A Details on the Experimental Study and Results
In the Appendix, we report detailed information on the experimental study and the results discussed in Section 4.
a.1 VRP Formulation
In this section, we provide a MIP formulation for the VRP, which is similar to that in [7], with the exception that the problem is constrained by .
(6)  
s.t.  (7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
The decision variables are , , and
: The binary variable
equals if vehicle uses arc . The binary variable equals if vehicle visits node . The quantity of accepted requests from vehicle at location is given by the nonnegative variable . Objective function (6) minimizes the routing cost. Constraints (7) and (8) ensure that the each vehicle goes in and out of the visited nodes. Constraints (9) assert that each vehicle can visit a node at most once, and (10) limits the number of vehicles to at most . Constraints (11) ensure that every tour is connected, and (12) restrict vehicle from collecting more than the capacity at location . Finally, Constraints (13) guarantee that the capacity is not exceeded, and (14) ensure that the accepted requests at each location are fulfilled.a.2 Instances
In this section, we detail the parameters that define the 4 sets of instances. In each of the instances we determine the capacity to be proportional to the inverse demand of the locations. Specifically, we use a load factor , and determine the capacity by
(18) 
4 Locations:
The locations given by , the number of periods , the revenue for accepting a request from each location are defined by , , , . The probability of no request is , for . The initial request probabilities for each location are , , , . The remainder of the request probabilities are then given by for and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .
10 Locations:
The locations given by , the number of periods , the revenue for accepting a request from each location are defined by for , for , and for . The probability of no request is , for . The initial request probabilities for each location are for , for , and for . The remainder of the request probabilities are then given by for , for , and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .
15 Locations:
The locations given by , the number of periods , the revenue for accepting a request from each location are defined by for , for , and for . The probability of no request is , for . The initial request probabilities for each location are for , for , and for . The remainder of the request probabilities are then given by for , for , and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .
50 Locations:
The locations given by , the number of periods , the revenue for accepting a request from each location are defined by for , for , and for . The probability of no request is , for . The initial request probabilities for each location are for , for , and for . The remainder of the request probabilities are then given by for , for , and for . The coordinates for each location are sampled uniformly at random in the interval . The number of vehicles available at no cost is and the cost for each additional over is . The capacity is determine using a load factor .
a.3 Mean Profit and Computing Times
Table 2 reports the time required to generate data and train supervised learning model. These times required for all algorithms but BLP, BLPR, and rand.
Locations  Data Generation Time  Training Time 

4  285.77  0.19 
10  1577.34  0.76 
15  2698.57  0.90 
50  2379.68  1.88 
Tables 3 reports the mean profit obtained by each approach as well as the offline and online computing times as described in Section 4.
Locations  Method  Mean Profit  Online Time  Offline Time 

4  DPExact  113.88  0.28  2874.83 
DPML  113.13  0.30  29.23  
SARSA  109.33  0.27  558.18  
MCTSrand30  109.58  5.47    
MCTSrand100  112.23  22.01    
MCTSSARSA–30  111.40  6.35  558.18  
MCTSSARSA100  111.03  24.46  558.18  
BLP  78.35  0.02  0.01  
BLPR  82.81  0.03  0.01  
rand0.6  71.49  0.32    
10  SARSA  189.25  0.91  1040.38 
MCTSrand30  199.45  9.06    
MCTSrand100  201.95  36.04    
MCTSSARSA–30  194.54  11.81  1040.38  
MCTSSARSA100  197.11  43.77  1040.38  
BLP  118.94  10.93  2.84  
BLPR  130.01  12.63  2.84  
rand0.7  158.64  0.95    
15  SARSA  400.43  1.72  1640.48 
MCTSrand30  422.19  18.30    
MCTSrand100  425.14  74.87    
MCTSSARSA–30  423.38  25.32  1640.48  
MCTSSARSA100  421.59  97.51  1640.48  
rand0.7  357.95  1.55    
50  SARSA  1098.78  1.31  7045.92 
MCTSrand30  1095.02  74.92    
MCTSrand100  1118.60  296.10    
MCTSSARSA–30  1112.45  123.47  7045.92  
MCTSSARSA100  1153.71  450.05  7045.92  
rand0.7  862.52  1.04   
a.4 Model Parameters
For supervised learning, we use the implementation of random forests from scikitlearn [18]. We note that our results were obtained using the default parameters provided in scikitlearn and changing model parameters did not have a significant impact on results.
For SARSA, we implement exploration by taking a random action with probability . We use Pytorch [17] to implement the neural value function approximation. For all sets of instances, we use Adam optimizer [9] with MSE loss. We use a neural network with one hidden layer and learning rate that vary depending on the problem setting. The parameters for each instance are reported in Table 4. In each setting, we train SARSA for 25,000 iterations and evaluate the mean profit over 50 validation instances every 100 episodes. We then use the SARSA model that obtained the highest mean profit in the remainder of our experiments.
Locations  Hidden layer dimension  Learning rate 

4  128  
10  256  
15  256  
50  1024 
Neural network hyperparameters
For MCTS, the UCT hyperparameter for each instance and algorithm is provided in Table 5. To determine these values, we did a set of experiments to see what achieved the best quality solutions on a small number of validation instances.
Locations  MCTSrand30  MCTSrand100  MCTSSARSA30  MCTSSARSA100 

4  1.0  1.0  1.0  10.0 
10  1.0  1.0  0.001  0.001 
15  100.0  100.0  100.0  1.0 
50  10.0  100.0  10.0  100.0 
Comments
There are no comments yet.