1 Introduction
Our research project is situated at the nexus of Operations research (OR) and Machine learning (ML). OR methods can be applied to a wide range of hard, discrete optimization decision problems and are crucial in numerous contexts throughout the world, e.g., for transportation planning, production management or control of energy systems. Fast solution algorithms are generally available for the deterministic problems whose characteristics are known exactly. However, most actual problems are naturally stochastic and much harder and timeconsuming to solve, whence largescale applications are typically restricted to deterministic formulations despite a lack of realism.
Our objective is to demonstrate the application and performance of a stateoftheart neural machine translation (NMT) algorithm for generating fast and accurate predictions of solutions to a stochastic discrete optimization decision problem. We build upon the methodology delineated in Larsen et al. (2018) and assume (i) that we can compute solutions to a decision problem under full information using an existing deterministic optimization model and a solver and (ii) that information about the problem is revealed progressively, full information being available only at the time when the decision problem is ultimately solved. We wish to predict certain characteristics of this fully informed solution, based on currently available partial information, calling such a characterization a tactical solution description. Faced with this stochastic optimal prediction problem, we predict the tactical solution descriptions using supervised ML, where the training data consists of a large number of fully informed problems that have been solved independently and off line. Crucially, we attend to applications where the predictions must be delivered with high accuracy and speed, actually much faster than solving a single decision problem under full information.
Reaching beyond the roots shared with Larsen et al. (2018), we focus on tasks where the tactical solution and the available information are more detailed. Whereas Larsen et al. (2018)
concentrates on the class of feedforward neural networks mapping between vectors of fixed length, we extend the scope to the class of neural machine translation (NMT) approximators. The latter generically map between ordered sequences of variable length and, through specialization, also map between any two types of data representations among the following: fixedlength vector, bag of fixed or variable size, (ordered) sequence of fixed or variable length. For comparative purposes, we also present a faster, albeit less precise baseline model built upon a data transformation and a standard feedforward neural architecture.
Our closest neighbor in the ORML literature is Vinyals et al. (2015) that applies supervised training to an NMT predictor in order to approximate solutions to a particular class of deterministic integer linear programming (ILP) problems. Nair et al. (2018)
also addresses the prediction of ILP solutions under imperfect information. However, its methodology, based on reinforcement learning, requires simulations and cannot deliver predictions at a speed that is sufficient for the applications that we wish to address. For a survey of the nexus of OR and ML, see
Bengio et al. (2018).Our motivating application concerns booking decisions of intermodal containers on doublestack trains: The assignment of containers to slots on railcars is a combinatorial optimization problem – called the load planning problem (LPP) – that cannot be solved deterministically at the time of booking due to imperfect information: the LPP depends on characteristics of both railcars (e.g., weight capacity, geometric loading restrictions), and containers (e.g., weight and size), and container weights are not available at the time of booking. Whereas
Larsen et al. (2018) predicted how many containers of each type in a given set can be loaded on a given set of railcars of different types and how many of the railcars of each type are needed, here we predict with greater detail the number of containers of each type to be loaded on each railcar. In view of the intended realtime application in a high volume booking system, the solution must be computed in very short time (fraction of a second). This problem is of practical importance in the particular railway management regime of blocking with reclassification and features characteristics that make it useful for illustrating the proposed methodology: Although an ILP formulation of the problem can be solved under full information by commercial ILP solvers in seconds to minutes (Mantovani et al., 2018), this formulation cannot be used directly for the application since the container weights are not known at the time of booking. Furthermore, for the purposes of booking decisions, the operational solution (assignment of each container to positions on railcars) is unnecessarily detailed.Contributions
The paper offers these contributions to the ORML literature:

Demonstrate the use of a stateoftheart NMT model for predicting solution descriptions of generic combinatorial decision problems under imperfect information in very short computation time.

Illustrate how to represent the combinatorial problem statement and solution description with the input and output vocabularies and syntaxes of an NMT model.

Indicate how to enforce constraints restricting either the inputoutput map or the output itself throughout training, validation and testing of the NMT model, as well as at prediction time, with a probability mask.

Define a measure of the discrepancy between predicted and actual solution descriptions that is invariant to order.

Present a baseline model built upon a data transformation and a standard feedforward neural architecture.

Illustrate the proposed methodology and perform a detailed analysis of its predictive and computational performance through an application to the containerrailcar LPP.
The remainder of the paper is structured as follows: Section 2 presents the related literature. It defines the NMT approximators and delineates their role in approximating the prediction function relating the statement of an optimal prediction problem and its solution. Section 3 adapts the NMT approximator to the containerrailcar LPP problem. Chiefly, this involves specifying input and output languages and syntaxes, imposing constraints relating inputs and outputs and defining an appropriate measure of the predictive accuracy. Section 4 examines the computational properties of the NMT approximators in their application to the LPP problem. First, Sections 4.1 to 4.4 lay the required groundwork by describing successively the ML apparatus used in training and validating the NMT approximators, the containerrailcar LPP data, the baseline NMT approximator and a lower bound for the predictive error based on an approximate stochastic programming solution. Finally, Section 4.5 discusses extensively the predictive performances and computation times achieved by alternative versions and implementations of the NMT and baseline approximators.
2 Related Literature
We define formally the optimal prediction problem and explain the role of NMT approximators in approximating the prediction function relating the statements of the optimal prediction problem and their solutions. We also present the NMT model lying at the core of the main NMT approximator under examination.
2.1 The Optimal Prediction Problem
We briefly state the optimal prediction problem of Larsen et al. (2018) and highlight the scope of our work. Let a particular instance of a fully informed (deterministic) optimization problem be represented by the input feature vector x. The optimal fully detailed (i.e., that containing values of all decision variables), fully informed solution is , where and denote respectively the cost function and the admissible space. We define a partition where contains available features and unavailable ones at the time of prediction. Furthermore, we denote by the mapping from the fully detailed and informed solution to the tactical solution description featuring the level of detail relevant to the context at hand. Hence, is the synthesis of the fully detailed and informed solution according to the tactical solution description embedded in . Our goal is to approximate the solution to the following twostage, optimal prediction stochastic programming (see, e.g., Kall and Wallace, 1994, Birge and Louvaux, 2011, Shapiro et al., 2009) problem:
(1) 
(2) 
where denotes a suitable norm (e.g. the  or norm when the output has fixed size) and
denotes either the expectation or a quantile (e.g., the median) operation over the distribution of
, conditional upon . Hence, is the optimal prediction of the synthesis of the secondstage optimizer , conditionally on information available at first stage. Finally, is the admissible space defined by the set of constraints relevant to the fully informed context, whereas is defined only by the set of constraints relevant to the partially informed context.We aim to generate a prediction function that can take any value of as input, and outputs accurate predictions of . The difficulty of this task largely depends on the complexity of . In comparison with Larsen et al. (2018), we consider the challenging case of a more detailed tactical solution description. Here, the predictions are given by where is a particular NMTbased approximator and is a vector of tuned parameters.
We select and through supervised ML based on inputoutput data made up of pairs. Practically, treating the stochasticity in hinges on the particular definition of the data pairs that are used for learning. Similar to Larsen et al. (2018), we use an implicit method by simply passing the dataset to ML. Larsen et al. (2018) already generated this data through a controlled probabilistic sampling of ILP problem statements (input) and by computing their solutions (output) with a standard ILP solver.
2.2 NMT Approximators
Although its original purpose was to perform NMT between natural languages, the NMT apparatus can generically map between finite input and output sequences of either fixed or variable length, either ordered or unordered, whose elements take values in finite sets. As far as we know, Sutskever et al. (2014) was the first to characterize the NMT apparatus as a generic approximator, introducing the expression “sequence to sequence learning”. Hence, we find the term “NMT approximator” appropriate.
2.2.1 NMT Models as Building Blocks of NMT Approximators
The chain rule of probabilities is the foundation of the NMT approximator. It expands the joint probability of the complete output sequence, conditionally upon the input sequence, as the product of the marginal probabilities of the single output elements, conditionally upon the previous output elements and the complete input sequence. NMT seeks a solution to
where(3) 
is the probability of output sequence in the target language conditionally upon input sequence in the source language and denotes element of the output sequence. An NMT model is simply a ML representation for . For a survey of NMT models, see Young et al. (2017).
(5) 
Standard supervised training, validation and testing can be used to select a particular class of NMT models, the associated hyperparameters and the parameter values . Recall that in our case the data consists of pairs of input and output sequences .
To generate the prediction of an output sequence given an input sequence, a search procedure successively selects individual elements of the output so as to maximize the joint probability of occurrence of the predicted output sequence conditionally on the input sequence. Calculations are based on (4). The required marginal conditional probabilities are calculated with the trained NMT model . Beam search (breadthfirst search with limited memory) is widely used in the search procedures in view of its excellent performance and modest computational requirements.
Finally, we note that probability masks attached at each step in the search process to can be used to enforce constraints restricting either the map between and or only . General constraints can be enforced in this manner. We devote Section 3.2 to a discussion on how we impose constrains in our application.
2.2.2 A Specific NMT Model
The particular NMT model that we implement in our application is essentially the recurrent neural network (RNN) model of
Bahdanau et al. (2014), in which we introduce a number of additions and adaptations pertaining to the choice of the input and output vocabularies and syntaxes, as well as to the imposition of inputoutput constraints restricting the output syntax. Its architecture comprises the following elements:
Encoder: A neural network mapping the sequence of inputs into a sequence of latent “annotations” of same length. Annotations are meant to summarize the content of the input sequence and are made up of the concatenation of forwardstepping and backwardstepping states. Forward and backward encoding states are each generated with an RNN, equipped with gated recurrent units (GRU) whose role is to control how much the previous encoding state should be updated or propagated given the current input element.

Attention mechanism: A neural network responsible for producing a custom weighted average (a.k.a. “context”) of all encoding annotations at each step in the generation of the output sequence. Its weights (a.k.a. “alignments”) depend on the previous decoding state (defined below) and all encoding annotations. The attention mechanism is meant to flexibly exploit all annotations originating from the encoder for generating the current output element. Whereas the original model comprises a single attention mechanism, we also experimented with an extended architecture comprising multiple specialized encoders and attention mechanisms.

Decoder: A neural network mapping the sequence of current contexts produced by the attention mechanism and the sequence of previous output elements into a sequence of decoding states. The decoding states are generated with an RNN equipped with GRUs controlling how much the previous decoding state should be updated or propagated given the current context and the previous output element.

Output layer: A neural network mapping the current context and the past decoding state into the probability distribution of the current output element conditional upon input sequence and past output elements.
The introduction of an attention mechanism – the seminal contribution of Bahdanau et al. (2014) – was responsible for bringing NMT models to the forefront of automatic translation. This innovation resulted in the first endtoend NMT model to attain stateoftheart performance in automatic translation (previously achieved with sequential statistical translation models) from English to French and English to German. It has spawned an abundant literature whose contributions range from straightforward alterations to major reinterpretations of the attention mechanism and inclusion into radically different neural architectures that exclude RNNs. See, e.g., the literature on socalled transformers initiated by Vaswani et al. (2017).
Overall, the improvements in performance achieved so far in the field of natural language translation by the successors to Bahdanau et al. (2014)
have been modest. However, replacing the RNNs can lead to substantial reductions in computational times by allowing parallelization. RNNs lack in this regard. The sizes of the sets where the input and output sequences take their values (i.e. the vocabularies) are considerably smaller in our application than they are in natural language processing (less than 200 vs several thousands). Hence, the lack of parallelization resulting from the use of RNNs presents a smaller disadvantage in our application. Transformer NMT model may be considered for realtime calculations requiring even more speed than that can be afforded by the model in
Bahdanau et al. (2014) or those among its followers whose architectures are based on RNNs.3 Adapting the NMT approximator to the Load Planning Problem
We adapt the NMT approximator based on Bahdanau et al. (2014) to the containerrailcar LPP. Whereas Larsen et al. (2018) predicts how many railcars of each type are used and how many containers of each type are loaded, we aim to predict loadings of individual railcars. A loading is a tuple specifying the type of a railcar that is being used and the numbers of containers of each type that are assigned to it. Predictions featuring this level of detail cannot be accomplished with the simpler predictive models considered in Larsen et al. (2018).
We provide an illustrative example in Figure 1. Inputs of the NMT approximator are identical to those of Larsen et al. (2018). Namely, they state the numbers of available railcars of each type and the numbers of assignable containers of each type. This example features three types of railcars and two types of containers. The particular instance shown includes one available railcar of each type and four assignable containers of each type. Our application covers the ten most common railcar types in North America, accounting together for close to 90% of the rail car fleet, and two types of containers. The righthand side of Figure 1 illustrates the output. At the top, the output of Larsen et al. (2018) states the numbers of assigned containers of each type and the numbers of used railcars of each type. At the bottom, the output of the NMT approximator is of variable length and indicates for each used railcar how many of each type of container are to be loaded on it (without specifying the actual assignment to places on the railcar which would correspond to the operational solution).
The output of the NMT approximator is a sequence of variable length with elements taking values in a finite set (since there are finitely many railcar types and there are for each type finitely many ways to load it with containers). For our purposes, the output sequence does not need to be ordered. However, ordering of input and output sequences is integral to the architecture of an NMT model. Hence, we simply disregard the order of the loadings in the predictions that are generated with the model and tally the count of each type of loading in the generated output sequence.
The application of the NMTbased conditional predictor to the containerrailcar LPP requires that LPP statement and solution description be expressed in the input and output languages. Essentially, we need to select appropriate vocabularies and syntaxes for the input and the output. Ceteris paribus, small vocabularies and simple, regular syntaxes are desirable in order to exploit the available training data efficiently. Any outstanding constraint restricting the map between LPP statement and solution or the LPP solution itself must also be upheld by the output syntax of the NMT. We discuss the details in the following two sections.
3.1 Vocabularies and Syntaxes
The NMT input phrase describing an LPP statement is made up of 26 tokens. The first 20 tokens are divided into 10 pairs. Each pair is associated with a particular type of railcar indexed from type 0 to type 9. Using a decimal representation, a pair specifies with two digits the number of available railcars of a particular type. Hence, the number of available railcars can range from 0 to 99 based on this syntax. In a similar fashion, the last 6 words in the input phrase are divided into 2 triples. Each triple is associated with a particular length of container, that is either 40 ft or 53 ft. Using a decimal representation, a triple specifies with three digits the number of assignable containers of this length. Hence, the number of assignable containers can range from 0 to 999. A distinct set of tokens representing digits from 0 to 9 is assigned to each railcar type and to each container length. Hence, the input vocabulary is made up of tokens.
Notice how a syntax based on the digits of a numerical representation implicitly modulates the effective length of the input sequence in relation with the number of represented objects: only higher input counts will have variable, non zero digits in the higher powers of their representation. This contributes to maintain sufficient statistical leverage throughout the input and output range (see Vinyals et al., 2016, p. 3).
The output phrase describing the LPP solution description is based on a vocabulary of size 157. Notwithstanding empty loads, the examination of the admissible loading patterns for the 10 railcar types considered in our application indicates that there are in total 155 ways in which some 40 ft and 53 ft containers can be loaded onto the 10 types of railcars. Hence, there are 155 tokens associated with these 155 distinct feasible loadings. There are two additional tokens in the output vocabulary. The token “EOS” terminates every output phrase. The token "BLANK" signifies that no container was loaded onto any railcar. The latter cannot be preceded by any other token and must be followed by EOS. Hence, an output sequence must match either one of these cases:

Variable number of output tokens belonging to the 155 tokens describing the feasible loadings, terminated by the token EOS.

BLANK followed by EOS.
3.2 Constraints Restricting the Output Syntax
The marginal conditional probability distribution
produced by the NMT model must assign zero probabilities to the following occurrences: The current token
implies the assignment of a number of containers that exceeds the number of remaining assignable containers;

implies the use of an unavailable railcar,

BLANK follows another token,

BLANK is not followed by EOS,

EOS appears in first position,

EOS is not last in the output sequence.
These requirements may be enforced by updating and applying a probability mask in the output layer of the NMT model. The mask projects the probability of any excluded event to zero and remaining nonzero probabilities are normalized so as to sum to one. The application of constraints with a probability mask is also an alternative to the approach exemplified by Vinyals et al. (2015) where the relevant constraints are embodied in the model (pointer network) architecture.
3.3 Measuring the Predictive Accuracy
The following statistic measures the mean unordered discrepancy over the observations of a data set in the containerrailcar LPP application between the actual output sequences and the corresponding predicted output sequences generated with a trained approximator.
(6) 
where

the index of a railcar type runs from to ( in our application),

the index of the padded, actual and predicted loadings in observation for railcar type runs from to , ,

is the number of actual loadings in observation related to railcar type before padding,

is the number of predicted loadings in observation i related to railcar type before padding,

the index of a container length runs from to ( in our application),

acts as a permutation of the loading indexes for a given observation and a given railcar type ,

is the actual number of loaded containers of length for loading among the loadings related to railcar type in observation before padding,

is the padded, actual number of loaded containers of length for loading amond the loadings related to railcar type in observation ; and are padding, empty loadings,

is the predicted number of loaded containers of length for loading among the loadings related to railcar type in observation before padding,

is the padded, predicted number of loaded containers of length for loading amond the loadings related to railcar type in observation ; and are padding, empty loadings.
Practically, the inner expression can be translated into the following program that we solve with a general purpose integer programming solver, :
(7) 
subject to:
(8) 
(9) 
(10) 
is invariant to permutations in the output sequence. This is as it should be since we are not concerned with the order of an output sequence for our purposes, but rather with the counts of the types of loadings appearing in the solution. could not be used in the training process since it is not amenable to analytical gradientbased optimization. Although it could in principle have been used in the validation process, it was not, since its computation is too highly demanding for these purposes.
4 Computational Results
This section examines the computational properties of the NMT approximators in their application to the LPP problem. Sections 4.1 to 4.4 establish the required foundations. They successively describe the ML apparatus used in training and validating the NMT models, the containerrailcar LPP data, the baseline NMT approximator and a lower bound for the predictive error based on solutions originating from approximate stochastic programming. Section 4.5 discusses extensively the predictive performances and computation times achieved by alternative versions and implementations of the NMT and baseline approximators.
4.1 The ML apparatus
We trained the NMT and baseline models with pseudolikelihood maximization and stochastic minibatch gradient descent. Except when considering extension to multiple encoders and attention mechanisms, the sizes and numbers of layers in the NMT model were kept as in Bahdanau et al. (2014). We experimented with the adadelta and adam methods (see, e.g., Sebastian Ruder, 2019)
of automatic learning rate (i.e., step size) adaptation. Regularization was exerted through dropout and early stopping. The latter was applied with a patience of at least one epoch. Minibatch size was equal to 64 throughout. The validation process applied in the selection of hyperparameter settings and for the purpose of early stopping was based on pseudolikelihood. As explained in Section
2.2, predictions were generated with beam search.Training, validation and prediction generation operations were performed on a single highpower GPU (Nvidia V100 or Titan XP). GPU RAM requirements for training, validation and prediction generation were and in most cases well below 4 GB. We used the Python 2.7 programming language with the Theano 1.02 symbolic computation library
(Theano Development Team, 2016) and the Groundhog metalibrary (GroundHog development team, 2015). The code made publicly available by the authors of Groundhog and Bahdanau et al. (2014) was adapted and extended (see GroundHog development team, 2015). The Cython C language compiler (Behnel et al., 2011) ensured the fast calculation of the probability masks. Measurement of the final predictive performance was performed with Java and CPLEX on an Intel i7 processor.4.2 The ContainerRailcar LPP Data
Our data is similar to that of Larsen et al. (2018). It is partitioned into four classes, as reported in Table 1. This partitioning facilitates experiments where models are trained and validated on easier instances (A) and tested on either similar, harder (B, C) or hardest ones (D). There are two data sets of class A: A’ contains 10 M instances and has been randomly divided into training (64%), validation (16%) and test (20%) sets. A” is independent from A’ and contains 100 K instances. There are additional independent 100 K data sets for each one of the other classes. Those are labeled respectively B”, C” and D” and are used in conducting extraneous testing.
Class name  Description  # of containers  # of platforms 

A  Easiest ILP instances  [1, 150]  [1, 50] 
B  More containers than A (excess demand)  [151, 300]  [1, 50] 
C  More platforms than A (excess supply)  [1, 150]  [51, 100] 
D  Largest and hardest instances  [151, 300]  [51, 100] 
4.3 Baseline Model
Stateoftheart NMT models are complex and require considerable investments in knowledge and computational resources for their training, selection and implementation. We are interested in comparing the predictive performance and computational requirements associated with an NMT model (Bahdanau et al., 2014) with those of a simpler, standard neural network. Our idea is as follows: since the size of the output vocabulary is quite small (157), it is possible to transform the data so that it in effect constitutes a sample of the marginal probability of the next loading, conditionally upon the input sequence and all previously committed loadings. Such a transformation would be intractable with the output vocabulary sizes typically encountered in natural language processing. The purpose of the baseline model is to build a baseline approximator along the lines explained in Section 2.2.1. The latter will be evaluated in terms of its own merit and will also act as a reference in appraising the usefulness of the NMT approximator.
Transforming the original data set
Consider a particular example in the original data set. The latter consists, on the one hand, of an input sequence specifying the numbers of available railcars of each type and the numbers of assignable containers of each length and, on the other hand, of an output sequence of variable length, say , made up of individual railcar loadings. This particular original example can be expanded into new examples in the transformed data set. Each one of the new examples features an input sequence of a fixed length equal to . The elements in this sequence stand for the 10 numbers of available railcars of each type, followed by the 2 numbers of assignable containers of each length, followed by 157 numbers specifying the numbers of committed loadings of each type, defined according to the original output vocabulary. All new examples share common first 12 elements in their input sequence. These 12 elements reflect the input sequence of the original example. Output of a new example consists of one loading type index among the 157. The indexes appearing in the outputs of the new examples correspond to the loadings appearing in the original output sequence of length . Hence, the last 157 elements of the new examples from to of the transformed data set can be generated successively as follows.

New example with index has zeros in all positions from 12 to 178 of its input sequence and its output is equal to the index of the first loading in the original output sequence.

Input sequence of new example with index , results from input sequence of new example by adding 1 to the number appearing in position , where is the the output of new example . Output of example is equal to index of the ith loading in the original output sequence according to the original output vocabulary.
Feedforward Architecture
Based on the transformed data set, a standard classification model can be trained so as to approximate the marginal distribution of the next loading, conditional upon the original input sequence and the loadings previously committed. For this purpose, we selected a feedforward neural network with 179 units in the input layer and 157 units in the softmax output layer. All units were equipped with rectified linear activations (ReLU). A small number of alternative configurations in a range of numbers of hidden layers and units per hidden layer were considered. The constraints described in Section
3.2 above were imposed during training, validation and generation. A probability mask similar to that of the NMT model was applied for this purpose.4.4 A Lower Bound for the Predictive Error
The stochastic limits of the statistic
that are estimated in Section
4.5 for the stated combinations of model, training algorithm, training set and validation or testing set are all bounded from below and away from zero due to the stochastic nature of the prediction problem. Estimates of the relevant lower bounds can be calculated from approximate solutions of the corresponding optimal prediction stochastic programming problems. For this purpose, we use the sample average approximation (SAA) method. Such lower bounds make it possible to assess how far the NMT approximators are from performing optimally and if an attempt at improving their performance by increasing the capacity of the NMT model and/or the size of training data is worthwhile. We stress that approximate stochastic programming solutions such as SAA must be calculated for every particular value of , whereas the NMT and baseline approximators define prediction functions that are valid for every value of .The SAA statistics reported in Table 2 are based on twostage samples whose first stage is given by the 100 K examples of A” and whose second stage comprise samples of container weights (i.e., scenarios) with sizes equal to 5, 10, 25, 50, 99. Since the estimated mean value of monotonically decreases from 0.108 to 0.079 at a decreasing rate as a function of the sample size (i.e., number of scenarios) and is identical for sample sizes 50 and 99, it is reasonable to view 0.079 as being in the vicinity of the lower bound for the optimal prediction problem. We shall refer to this value as the SAA bound over the set A”. Table 2
also reports estimates of the means and standard deviations of the times required to compute the SAA solution for each second stage sample size. We note that, as expected, the computing times and the associated standard errors increase with the number of scenarios while the standard error of
decreases.Calculation of an individual SAA solution to the LPP proceeds through these steps: for a given set of first stage variables (inputs whose values are available in advance in the application), draw a sample of sets of second stage variables (inputs whose values are unavailable in advance). Using an ILP solver, compute LPP solutions for each combination of the set of first stage values and one of the individual sets of second stage values. Minimize the sample average of over all LPP solutions thus computed. The resulting minimizer is the sought SAA solution.
Nb. of scenarios  Computing time [s]  

Est. mean  Est. std dev  Est. mean  Est. std dev  
5  0.089  0.166  4.649  15.841 
(5.26E03)  (5.01E02)  
10  0.084  0.157  9.250  18.187 
(4.96E04)  (5.75E02)  
25  0.081  0.152  23.182  29.817 
(4.8E04)  (9.43E02)  
50  0.079  0.150  46.539  54.130 
(4.73E04)  (1.71E01)  
99  0.079  0.148  93.277  109.972 
(4.69E04)  (3.48E01)  
* number of sets of weights (scenarios) drawn for each example  
Standard error of estimate is reported between parentheses. 
4.5 Predictive Performance and Computing Times
Table 3 reports estimates of the mean and standard deviation of the predictive error achieved over a number of sets by the model trained and validated over the 6.4 M training examples and 1.6 M validating examples of data set A’. The first row reports the predictive performance over the 1.6 M validating examples of data set A’. Data set A” in second row is independently drawn from the same distribution as A’ and the predictive performance over this set can be compared to the SAA bound equal to 0.079. Distributions of data sets B”, C” and D” differ from that used for generating training and validation data and the predictive performance over these sets indicates the extent to which a model trained and validated over easiest examples found in data set A” can generalize to the harder and hardest examples in data sets B”, C” and D”. Table 4 reports estimates of the mean and standard deviation of the corresponding prediction times.
The results show that the prediction error incurred by the NMT approximator over data independent from but similar to the data used for training and validation is in the vicinity of the the SAA bound and therefore quite good. While there is still room for improvement, it must be noted that the NMT approximator reaches this performance with very little hyperparameter tuning beyond the final values reported in Bahdanau et al. (2014). Furthermore, the estimated standard deviation of is small and, importantly, estimated mean prediction times are in the order of milliseconds, hence within our limited budget, and estimated standard deviations of prediction times are small. The estimate of the mean error incurred by the SAA predictor over the set A” is smaller than that of the MNT approximator, even when using as few as 5 scenarios. However, the estimated mean computing time is several orders of magnitudes longer and with larger standard deviation. In comparison to the NMT model, the baseline model exhibits a surprisingly good performance and estimated mean computing times of the baseline approximator are even shorter than those of the NMT approximator.
Data  NMT 1attention  Baseline  

# examples  Est. mean  Est. std dev  Est. mean  Est. std dev 
A’ 1.6 M  0.101  0.169  0.162  0.218 
(1.33E04)  (1.72E04)  
A” 100 K  0.115  0.179  0.172  0.222 
(5.65E04)  (7.00E04)  
B” 100 K  0.054  0.127  0.063  0.144 
(4.03E04)  (4.56E04)  
C” 100 K  0.257  0.197  0.270  0.168 
(6.24E04)  (5.31E04)  
D” 100 K  0.181  0.178  0.096  0.156 
(5.63E04)  (4.95E04)  
Standard error of estimate is reported between parentheses. 
Data  NMT 1attention  Baseline  

# examples  Est. mean  Est. std dev  Est. mean  Est. std dev 
A’ 1.6 M  0.025  1.55E02  0.010  6.07E03 
(1.23E05)  (4.80E06)  
A” 100 K  0.025  1.52E02  0.009  5.34E03 
(4.82E05)  (1.69E05)  
B” 100 K  0.031  1.71E02  0.010  6.25E03 
(5.42E05)  (1.98E05)  
C” 100 K  0.047  2.91E02  0.016  8.88E03 
(9.20E05)  (2.81E05)  
D” 100 K  0.091  3.02E02  0.028  6.63E03 
(9.55E05)  (2.10E05)  
Standard error of estimate is reported between parentheses. 
A number of findings emerged from the experiments leading to the selection of the best performing NMT approximator. For instance, we found in general that predictive performance was unaffected by increasing the beam width from 5 to 10 but that computing time was one order of magnitude larger on average.
While inputoutput constraints were always enforced at prediction time, we could choose to enforce them or not during training and validation. We achieved similar final predictive performance in both cases. However, it frequently caused numerical instability when truncating the probabilities at prediction time. As a result, we applied inputoutput constraints throughout training, validation and generation.
Table 5 reports the time required for training the NMT and baseline models over the 6.4 M examples of A’ whether the constraints are imposed or not and whether the adadelta or adam step size adaptation methods are applied. Training with adam when constraints were enforced was slower and required on average more than twice as much time as adadelta to reach the threshold of early stopping than when constraints were not enforced. Overall, the first four rows of the table illustrate that the benefits of the step size adaptation methods are contextspecific and that they must viewed as any other hyperparameter.
Model  Step  Constraints  time [s]  # epochs  Total 
(training)  / minibatch  time [h]  
NMT 1attention  adadelta  0.5  10.3  143  
NMT 1attention  adadelta  0.36  15.32  153  
NMT 1attention  adam  0.832  16.32  377  
NMT 1attention  adam  0.68  6  113  
NMT 3attention  adadelta  0.90  8  200  
NMT 3attention  adadelta  0.70  11  214  
Baseline  adadelta  0.016  6  2.5  
* of transformed data, approximately 0.6 epochs of original data 
The predictive performances reported in Table 3 are based on models trained on 6.4 M examples of class A’. In order to assess the impact of sample size on predictive performance, we trained the NMT and baseline models on 640 K and 64 K examples of class A’. On the one hand, a tenfold reduction in the size of the training set had no detrimental effect on predictive performance (measured over A’, estimate of mean value of equals 0.101 with an estimated standard error of 1.29E04). On the other hand, total training time until early stopping was considerably smaller, decreasing from 143 h to 50.5 h. However, a further reduction in the size of the training set to 64 K examples was consequential as the estimate of mean value of measured over A’ increased to 0.146 (with estimated standard error of estimate equal to 1.68E04).
We found that the single attention NMT model instance resulting from training and validation over respectively 6.4 M and 1.6 M examples of A’ achieves on the training set a pseudolikelihood value nearly identical to the pseudolikelihood value achieved over the validation set (0.000602 vs 0.000604, respectively). This points strongly to the conclusion that improvements in predictive performance, if any are possible, should be sought by expanding model capacity rather than the size of the training set. The latter is corroborated by the close value of 0.000644 measured over the same validation set when the training data set was tentatively doubled in size to 12.8 M.
As an attempt to usefully increase capacity, we implemented a multiple attention NMT model featuring three encoderattention pairs, instead of the single one found in Bahdanau et al. (2014). One pair focuses on the input elements related to the available railcars, another focuses on the input elements related to the assignable containers and a third one attends to the whole input sequence as in Bahdanau et al. (2014)
. This failed to improve upon the predictive performance achieved by the original singleattention model (on class A, estimate of mean value of
equals 0.101 for the multiple attention model with estimated standard error of estimate equal to 1.30E04). Moreover, as verified in Table 5, training times for the threeattention model are considerably longer than the singleattention model.As a quality control, we were interested in measuring the performance achieved by the NMT approximator when predicting simpler solution descriptions as in Larsen et al. (2018). The performance criterion in this case is the sum of the mean absolute prediction error over the number of used slots and number of loaded containers. The lower bound associated with this statistic could be computed with stochastic programming (SAA). It is equal to 0.82 (with an estimated standard error equal to 0.0087) whereas, depending on the size of the training set, the values reported for the statistic in Larsen et al. (2018) for a classification model are respectively 0.965 (standard error of 0.002) and 1.481 (standard error of 0.018). We find it reassuring that the NMT approximator achieves for this statistic a value of 1.421 (standard error of 0.003) over a similar test set with very little hyperparameter tuning.
5 Conclusion
This paper has addressed the problem of computing a close approximate solution to any particular instance of a stochastic discrete optimization problem in a fraction of a second. The difficulty of this problem depends on the desired level of detail of the solution. Building upon Larsen et al. (2018), we focused on predicting more detailed solution descriptions requiring machine learning models that can handle outputs of variable size. We aimed to verify whether a stateoftheart neural machine translation (NMT) algorithm could generate fast and accurate predictions of solutions with minimal adaptation to the model architecture and minimal hyperparameter tuning. We adapted the existing NMT models by introducing problem specific input and output vocabularies and syntaxes and inputoutput constraints. While these are specific to our application, we believe that similar vocabularies, syntaxes and inputoutput constraints can be tailored to a broad range of programming problems, for instance to bin packing problems. The proposed NMT and baseline approximators demonstrated good predictive performances that are in the neighborhood of the stochastic lower bound computed with sample average approximation. Their comparative advantage with respect to sample average approximation is that of very fast computation with very low standard deviation.
Acknowledgements
We are grateful to Yoshua Bengio who was involved in an initial discussion about this problem. We have also benefited from discussions with Andrea Lodi. This research was funded by the Canadian National Railway Company (CN) Chair in Optimization of Railway Operations at Université de Montréal and a Collaborative Research and Development Grant from the Natural Sciences and Engineering Research Council of Canada (CRD47793814). The research is also partially funded by the “IVADO Fundamental Research Project Grants” under project entitled “Machine Learning for (Discrete) Optimization”. Computations were made on the supercomputers Briarée and Guillimin, managed by Calcul Québec and Compute Canada. The operation of these supercomputers is funded by the Canada Foundation for Innovation (CFI), the Ministère de l’Économie, de la Science et de l’Innovation du Québec (MESI) and the Fonds de recherche du Québec  Nature et technologies (FRQNT). We also benefited from the computing resources provided by Mila.
References
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.
 Behnel et al. (2011) Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D. S., and Smith, K. Cython: The best of both worlds. Computing in Science & Engineering, 13(2):31–39, 2011.
 Bengio et al. (2018) Bengio, Y., Lodi, A., and Prouvost, A. Machine learning for combinatorial optimization: a methodological tour d’horizon. CoRR, abs/1811.06128, 2018. URL http://arxiv.org/abs/1811.06128.
 Birge and Louvaux (2011) Birge, J. R. and Louvaux, F. Introduction to Stochastic Programming. Springer Series in Operations Research and Financial Engineering. Springer New York, New York, NY, 2011.
 GroundHog development team (2015) GroundHog development team. Groundhog by lisagroundhog, 2015. URL https://github.com/lisagroundhog/GroundHog.
 Kall and Wallace (1994) Kall, P. and Wallace, S. W. Stochastic Programming. John Wiley & Sons, 1994.
 Larsen et al. (2018) Larsen, E., Lachapelle, S., Bengio, Y., Frejinger, E., LacosteJulien, S., and Lodi, A. Predicting solution summaries to integer linear programs under imperfect information with machine learning. CoRR, abs/1807.11876, 2018. URL http://arxiv.org/abs/1807.11876.
 Mantovani et al. (2018) Mantovani, S., Morganti, G., Umang, N., Crainic, T. G., Frejinger, E., and Larsen, E. The load planning problem for doublestack intermodal trains. European Journal of Operational Research, 267(1):107–119, 2018.

Nair et al. (2018)
Nair, V., Dvijotham, D., Dunning, I., and Vinyals, O.
Learning fast optimizers for contextual stochastic integer programs.
In
Proceedings of the ThirtyFourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 610, 2018
, pages 591–600, 2018. URL http://auai.org/uai2018/proceedings/papers/217.pdf.  Sebastian Ruder (2019) Sebastian Ruder. An overview of gradient descent optimization algorithms, 2019. available at: http://ruder.io/optimizinggradientdescent/index.html (Accessed on June 10, 2019).
 Shapiro et al. (2009) Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. Society for Industrial and Applied Mathematics, 2009.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.
 Theano Development Team (2016) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
 Vinyals et al. (2015) Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In NIPS, pages 2692–2700, 2015. URL https://arxiv.org/pdf/1506.03134.pdf.
 Vinyals et al. (2016) Vinyals, O., Bengio, S., and Kudlur, M. Order matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR), 2016. URL http://arxiv.org/abs/1511.06391.
 Young et al. (2017) Young, T., Hazarika, D., Poria, S., and Cambria, E. Recent trends in deep learning based natural language processing. CoRR, abs/1708.02709, 2017. URL http://arxiv.org/abs/1708.02709.
Comments
There are no comments yet.