    # Approximate Dynamic Programming with Neural Networks in Linear Discrete Action Spaces

Real-world problems of operations research are typically high-dimensional and combinatorial. Linear programs are generally used to formulate and efficiently solve these large decision problems. However, in multi-period decision problems, we must often compute expected downstream values corresponding to current decisions. When applying stochastic methods to approximate these values, linear programs become restrictive for designing value function approximations (VFAs). In particular, the manual design of a polynomial VFA is challenging. This paper presents an integrated approach for complex optimization problems, focusing on applications in the domain of operations research. It develops a hybrid solution method that combines linear programming and neural networks as part of approximate dynamic programming. Our proposed solution method embeds neural network VFAs into linear decision problems, combining the nonlinear expressive power of neural networks with the efficiency of solving linear programs. As a proof of concept, we perform numerical experiments on a transportation problem. The neural network VFAs consistently outperform polynomial VFAs, with limited design and tuning effort.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Problems in operations research (OR) are generally concerned with allocating resources, aiming to maximize some reward function. Applications of OR are found in domains such as transportation, energy, and manufacturing. Although many effective solutions – particularly linear programs (LPs) – exist for static problems, solving dynamic problems over a time horizon remains challenging, as we need downstream values corresponding to current decisions. Estimating these values is often difficult within linear programming settings.

We address the integration of neural networks and linear programs in the context of Approximate Dynamic Programming (ADP). Spurred by increasing availability of both data and computing power, neural networks are successfully applied in many fields. Their potential applications have also been identified for ADP, yet their use is not widespread. This paper extends an effort in this direction, explicitly considering implementation for problems with large action spaces.

A key challenge of ADP is to reliably estimate the downstream value corresponding to actions, enabling to learn a policy that maximizes value over the full planning horizon. One research stream within ADP focuses on value function approximations (VFAs) to estimate downstream values. In this approach, we design a set of features (explanatory variables) and organize them as a polynomial function that represents the value of being in a given state and perform linear regression to learn weights associated to the features. Although polynomial VFAs often yield satisfactory results, designing the features is challenging. Particularly higher-level interactions are difficult to grasp for human designers.

The problem of features design is amplified by the large action spaces that are encountered in typical ADP settings. It is common for the action space growing too large to enumerate within reasonable time. Formulating the decision problem as a mathematical program – preferably a linear program for efficient solving – preserves optimality and often vastly enhances the magnitude of problems that can be handled. Even if the action space is not overly large, there may be reasons to use mathematical programming, such as higher speed or the availability of existing model formulations. However, mathematical programming poses additional challenges for VFAs. Assuming a linear program, the variables representing the features in the objective function must be linear as well. Although nonlinear features might be designed, they must be expressed as linear systems, often requiring complicated constructions of artificial variables. Complex polynomial VFAs are therefore difficult to embed in LPs.

The integration of neural networks within LPs for decision making addresses this challenge. Neural networks are able to learn complex nonlinear functions; theoretically, a single-layer network may learn any continuous function [Cybenko1989]

. Because neural networks are not restricted by linearity, they may identify nonlinear structures between lower-level features, without explicitly defining these as features within the LP. The activation functions should be transformable into piecewise linear functions. Fortunately, most modern neural networks satisfy this condition.

In this paper, we design a hybrid approach to address ADP problems based on neural networks and LPs. The planning problem tested is a dynamic transport problem inspired by practice. To preserve focus on the methodological aspect of the paper, we will not discuss design choices in detail. The problem serves as a test case that is sufficiently rich and challenging to adequately test the solution method.

This paper contributes to the state of the art in the following ways. First, we design a hybrid approach for integrating neural networks and LPs to tackle ADP problems. Second, we provide insights into the performance of various neural network structures based on numerical experiments, specifically the quality and computational effort. Third, we show that neural network VFAs significantly improve upon current practice, which is based on polynomial VFAs.

## 2 Related work

Given the successful applications of neural networks in regression, their application on ADP problems seems natural. The idea is not novel; the seminal work of [Bertsekas and Tsitsiklis1995]

already presents the use of feature vectors as input to neural networks as an established concept. Also

[Powell2011]

describe neural networks as a powerful tool for ADP algorithms. However, neural network VFAs have not yet been well-tested for large action spaces – where the neural network cannot be used to enumerate the downstream value for every action – we are not aware of previous studies addressing the integration of neural networks into decision-making LPs for this type of problems. We highlight some relevant works in ADP and reinforcement learning, discussing the most closely related applications of neural networks and linear programming.

We start with neural networks in ADP. [Bertsekas2008] discusses applying neural networks in ADP, coining the term neuro-dynamic programming. He broadly defines neural networks as essentially nonlinear VFAs, using either the full state or a smaller feature vector as input. Alternatively, neural networks may also be used as a pre-processing step to extract feature vectors from the state. According to [Powell2011], neural network VFAs have mainly been applied on classical engineering problems that typically have low-dimensional action spaces. [Schmidhuber2015]

provides an survey of deep learning studies, including the use of neural networks in reinforcement learning. The neural networks are generally used to learn values associated to state-action pairs, i.e., as VFAs. No mention is made of embedding such VFAs in linear programs.

[Van Heeswijk and La Poutré2018] study shallow neural network VFAs in a transportation context, but require full enumeration of the action space.

We proceed to discuss linear programming in ADP. [De Farias and Van Roy2003] study the linear programming approach for ADP, assuming linearly defined VFAs. [Powell2016] states that decision problems with tens of thousands dimensions can generally be solved with modern commercial solvers. However, when instances become too vast, also linear programming may require unsatisfactory computational times. [Dulac-Arnold et al.2012] and [Pazis and Parr2011] propose factorization methods to divide the action space into linear subproblems, exponentially reducing the computational effort. The size of the state space is a limiting influence in their solution. In the transportation domain, [Pérez Rivera and Mes2017] and [Van Heeswijk et al.2019] provide recent examples of polynomial VFAs integrated in linear programs.

## 3 Solution method

We briefly introduce the notation for Markov decision problems (MDPs) as used in this paper. MDP models are useful to mathematically model decision problems with stochastic and dynamic properties. In OR, many are combinatorial optimization problems. An MDP is a stochastic control process for which the objective is to maximize rewards (or minimize costs) over a discrete time horizon

, with decision epochs

separated by equidistant time intervals. A discounted MDP can be described by , with being the set of problem states, being the set of feasible actions when in state ,

being the transition probability of transitioning from state

to after taking action , being the direct reward when taking action in state , and is the discount rate applied to future rewards. The Bellman equation yields the maximum value corresponding to each state:

 V(S)=maxx∈X(S)(R(S,x)+ρ\smashoperator[]∑S′∈S′P(S′|S,x)V(S′)).

Solving the Bellman equation for all states yields the optimal policy. Several techniques exist to accomplish this, yet for many realistic problems these are computationally intractable. The next section addresses this issue.

### 3.1 Approximate Dynamic Programming

Approximate Dynamic Programming (ADP) is a framework to learn policies for MDPs that are too large to solve exactly within reasonable time. This section provides a short and high-level overview. We refer to [Powell2011] for an extensive discussion on the topic. At its core, ADP uses Monte Carlo simulation to sample rewards and estimate the downstream values of state-action pairs, enabling to learn good policies without exhaustively exploring the MDP.

From a computational perspective, problems may arise in three areas of MDPs, namely the sizes of the state space (number of states), action space (number of actions per state) and outcome space (number of possible outcomes per action). Multiple solution approaches exist for each of these areas; we restrict ourselves to the ones used in this paper.

We start with the outcome space . To identify the best action in any state, the Bellman equation requires computing for each , where might be unique for each state-action pair. ADP circumvents this procedure by instead attaching a single value to a state-action pair. Thus, we replace the stochastic expression with a deterministic value function . For each state-action pair, we only need to evaluate one downstream value rather than outcomes. This downstream value is estimated by repeated Monte Carlo sampling, i.e., we randomly draw outcome states and observe their values.

Next, we discuss the state space . In many optimization problems the state is a high-dimensional vector with numerous possible realizations. Computing the value for each individual state may therefore be intractable. Therefore, we replace the true value function with a value function approximation (VFA) . The VFA is a function that returns an expected value given a set of features (explanatory variables) that capture the essential information in state-action pairs needed to estimate their value. The VFA design is further discussed in Section 3.2.

Finally, we address the action space . In combinatorial problems, this space quickly grows beyond the limits of enumeration. As we need thousands of observations to learn a good policy, each decision problem should typically be solvable within a few seconds. To avoid enumerating the full action space, the decision problem may be expressed as a mathematical program. In particular LPs are well-studied; modern solvers often solve such problems highly efficiently. Mathematical programs can be solved to optimality, while significantly upscaling the action space sizes that can be handled.

The outline of the ADP algorithm is now presented. We use iterations to learn the VFA; each iteration represents a discrete time step. At every iteration , the action maximizes expected value given the prevailing VFA , resulting in the following observed value:

 ^vn=maxxn∈X(Sn)(R(Sn,xn)+ρ¯Vn−1(Sn,xn)).

The difference between expected value for the preceding state-action pair at (i.e., ) and the observation at (i.e., ) updates the VFA, using an updating function . Algorithm 1 shows the outline of the ADP algorithm to learn the VFA.

###### Algorithm 1

Basic ADP algorithm to learn the VFA.

1: initialize ¯V0(⋅) n↤1 S1\ext@arrow3095\leftarrowfill@PS while n≤N do xn↤argmaxxn∈X(Sn)(R(Sn,xn)+ρ¯Vn−1(Sn,xn)) ¯Vn(⋅)↤U(¯Vn−1(⋅),Sn−1,xn−1,^vn) S′↤(Sn,xn) Sn+1\ext@arrow3095\leftarrowfill@PS′ n↤n+1 end while return ¯VN(⋅)

### 3.2 Polynomial VFA (PL-VFA)

This section addresses the VFA in more detail. As mentioned earlier, we operate on features that are extracted from state-action pairs. Let be the set of indicators describing the features, with each indicator referring to some representative feature of a state-action pair. We define a contractive mapping that extracts features for any given state-action pair, i.e., , the corresponding vector of features is . Formally, the VFA is described by .

VFAs are commonly designed in polynomial form (PL-VFA). Let be a weight associated to feature . Then, the polynomial VFA may be described by . PL-VFAs are popular for several reasons. Polynomials are able to approximate most functions, an appropriate polynomial in theory approaches the true value function arbitrarily close. Furthermore, although the features may be nonlinear, the expression itself is linear. It can therefore be incorporated into linear programming formulations. Techniques such as temporal-difference learning may be used to update the weights [Sutton and Barto2018].

Although polynomials might theoretically approximate the true value function, randomly defining a polynomial will likely not perform well [Powell2016]. A properly designed PL-VFA is aligned with the structure of the value function. This manual design of VFAs is a key challenge for successful implementations, requiring careful modeling and testing of individual value functions. This is where the linear formulation becomes restrictive, as features representing higher-order effects must be explicitly modeled. Additional problems arise when we resort to linear programming to handle large action spaces. It then becomes challenging to express non-linear features in linear form. Such conversions often require complicated structures involving many artificial variables.

To overcome the limitations of polynomial VFAs, the VFA may be expressed by neural networks. The nonlinear architecture of such networks allows to unravel complex structures, even when inputs are linear operands of state-action pairs. We further discuss neural network VFAs in the next section.

### 3.3 Neural network VFA (NN-VFA)

A general introduction to neural networks is provided by [Gurney2014], we only address the VFA design. In neural network VFAs (NN-VFAs), the feature vector

is transformed by a weighted set of nonlinear activation functions (neurons), resulting into a single output value

. Compared to the PL-VFA, the main advantage is that the NN-VFA may learn higher-order effects that are not explicitly defined in the feature vector. We emphasize that the input quality remains crucial for the NN-VFA performance, but feature design is comparatively easier than for PL-VFAs.

The NN-VFA is composed of an input layer (the feature vector), a least one hidden layer containing neurons, and an output layer with a single node that returns the expected value for the given state-action pair [Van Heeswijk and La Poutré2018]. In a fully connected network, every neuron in the network connects to all neurons in the preceding layer. Each neuron receives the inner product of all neurons in the preceding layer and their corresponding output weights as input and transforms it into a single neuron value.

The NN-VFA contains hidden layers; we use to denote the set of hidden layers. The indicator refers to the input layer that contains the features; layer is the output layer. Furthermore, the index refers to a specific neuron in layer , with denoting the set of neurons in layer . Each neuron represents a nonlinear activation function , corresponding to neuron in layer . For layers , an input weight describes the weight of a neuron as input for ; the vector denotes all inbound weights for neuron .

We introduce some additional notation to describe the neuron values. The value of neuron in layer is described by ; the value vector for layer is given by . The input layer equals the features, i.e., , with . The values of the neurons are expressed by . Finally, the output value of the network is given by .

Activation functions in neural networks are nonlinear. Therefore, they cannot be directly computed within linear programs. However, most common activation functions in contemporary neural networks can be modeled by simple piecewise linear functions. Integration of the NN-VFA in linear programs is discussed in the next section.

### 3.4 Integrating the NN-VFA in LPs

Nowadays, many neural networks use (variants of) rectified linear units (ReLUs) as activation functions

[Wilmanski et al.2016]

. A ReLU returns either 0 or its input value, whichever is larger. They can be represented by a piecewise linear function with two components, allowing to incorporate them in the LP designed to solve the decision problem. Each state-action pair has a unique expected downstream value. To evaluate actions, the neural network must therefore be expressed as a set of linear equations. We follow an implementation comparable to that of

[Bunel et al.2018]

, using binary variables and big M constraints to correctly compute the ReLU values. Additional artificial variables are required to compute the basis functions corresponding to actions. To preserve linearity of the action problem, the features should be linear expressions that can be derived from

.

We use a stochastic gradient descent (SGD) algorithm to update the weights, meaning that the network weights are iteratively adjusted after each iteration and corresponding observation

[Haykin2009]. At , we use He initialization to generate starting values for the weights [He et al.2015]. The learning rate determines how responsive the weights are to observations deviating from the estimate.

## 4 Experimental design

To validate the solution method as well as the performance of the NN-VFA, we run a number of numerical experiments that compare it to the PL-VFA. We evaluate both the behavior of the VFAs under varying circumstances and the performance of the resulting policies.

For a clear comparison that distills the essential insights, we keep the applications basic. To update the weights, we use TD(0) for the PL-VFA and SGD for the NN-VFAs, always using the same learning rate . In all cases, we use He initialization to set the weights at . We use pure exploration to acquire value observations, i.e., each decision maximizes the expected value given the prevailing policy. Furthermore, we deliberately do not put excessive effort into design and fine-tuning; the main goal of the NN-VFA is to reduce the manual design effort compared to the PL-VFA.

The experiments compare a PL-VFA to two neural network VFAs: the NN(1,20)-VFA (1 layer, 20 neurons) and the NN(3,20)-VFA (3 layers, 20 neurons per layer). Although a single-layer network theoretically suffices to learn a function, deep neural networks may model the same function with significantly fewer neurons [Delalleau and Bengio2011]. In fact, for many common functions, the required number of neurons decreases exponentially with the number of layers [Lin et al.2017]. [Rolnick and Tegmark2018] suggest that, for many functions encountered in practical settings, relatively small networks suffice to accurately describe functions. Downsides of deeper networks are the longer training time and potential loss of information [Huang et al.2016].

The experimental design is as follows. First, we compare convergence properties of VFAs. Second, we perform experiments on various neural network configurations and learning rates, giving insight into the behavior and robustness of the NN-VFA under varying conditions. Third, we report the computational times corresponding to various VFA configurations. Fourth, we evaluate the performance (i.e., the direct rewards) of the tested VFAs. We discuss offline performances over time – fixing the policy after every 10,000 training iterations – which is valuable when computational budgets are limited. We perform training iterations and 10,000 performance iterations per offline policy.

All procedures are coded in C++ and CPLEX 12.8 is used to solve the linear decision problems. The experiments were run on a 64-bit Linux machine with a 4x1.60GHz CPU and 8GB RAM.

### 4.1 Problem definition

This section outlines the transportation problem, which is based on the nomadic trucker problem [Powell et al.2007]. It is characterized by a large discrete action space and a complex optimal policy. Let a strongly connected graph represent a transport network. Vertex set represents the potential origins and destinations of transport jobs. Edge set specifies the undirected connections between vertices. Each edge has travel time 1. Edge lengths are distances between vertex pairs, used to compute travel costs. A capacitated agent roams the graph, traveling between directly connected vertices. At each decision epoch , the agent decides (i) which jobs to load, (ii) which jobs to unload, (iii) which vertex to visit next (or to stay at the current vertex).

We sketch the corresponding MDP. The problem state contains the information necessary for decision-making, namely the relevant properties of all transport jobs in the graph and the current location of the agent . Each job is defined by four properties, namely (i) the vertex at which the job is currently located, (ii) the destination vertex , (iii) the time remaining until the due date , and (iv) the assignment status ( means the job is currently carried by the agent). Each unique combination of properties constitutes a job type ; the number of jobs per type is denoted by . For the full system we define the vector . The problem state is given by ; the set containing all possible states is denoted by .

We proceed to describe the action . Let be the set containing both and the vertices adjacent to it. The variable describes the next destination of the agent. Furthermore, indicates that a job is unloaded and indicates that a job is loaded. The action is defined by . The action space is bound to various constraints; due to space limitations the full LP model is omitted. The key constraints are that (i) the agent may only (un)load at its current location, (ii) jobs are always unloaded when at their destination vertex, (iii) the agent’s transport capacity may not be exceeded.

Next, we describe the reward function . The rewards consist of the following components: (i) a fixed reward for each successful delivery and (ii) a reward for bringing a job closer to its destination, proportional to the reduction in shortest path distance (an increase in distance yields a negative reward). We proceed to discuss the costs: (i) a fixed cost per distance unit covered (i.e., a cost associated with each edge), independent of the number of jobs carried, (ii) a fixed cost associated with each job that is (un)loaded, and (iii) a penalty for violating due dates. Jobs may be voluntarily unloaded by the agent or forced to be unloaded when , i.e., when the due date has been reached. The reward components are linear with respect to jobs and distances, as to not give NN-VFAs an unfair advantage.

Feature design reflects the components of the reward function. The features are low-level and expressed by linear equations based on . We define the following features: a bias scalar, the number of jobs carried by the agent, the location of the agent, the number of jobs per vertex in , the total time slack per neighboring vertex, and the most likely vertex to visit after visiting the neighboring vertex (given the shortest path of each job). The total number of features is .

For the experiments, we use an instance with , a maximum degree of 3, and up to 5 new jobs generated per vertex at each epoch, with accumulation possible up to 45 jobs. The agent may carry up to 20 jobs. The action space grows exponentially with the number of jobs, rendering enumeration infeasible even for this modest instance. An upper bound for the size of is .

## 5 Numerical results

This section discusses the results of the experiments. We start with the convergence results. Preliminary experiments on simplified problem settings with trivial policies indicate that all VFAs work correctly, converging to the true optimal value function, i.e., . Figure 1 shows a convergence example for the real problem instance. The PL-VFA converges fastest, but to considerably lower values than the NN-VFAs. Similarly, the NN(1,20)-VFA converges faster than the NN(3,20)-VFA, but to somewhat lower values.

The next experiment addresses learning rates, testing . Figure 2 illustrates the convergence speeds per learning rate for the NN(1,20)-VFA; to aid the visual representation, we omit the other VFAs (which display comparable behavior). However, the NN(3,20)-VFA with did not converge to a stable policy. In general, we find that deeper neural networks are less robust with respect to larger learning rates. Errors may be magnified when passing through multiple layers, returning extreme values. Furthermore, NN-VFAs with do not converge within 100,000 iterations. We therefore use onwards.

Next, we look at the effects of altering network configurations, varying the number of neurons per layer. The results are shown in Table 1. The results are fairly robust, with the exception of the NN(1,10)-VFA, which performs comparatively poorly. Balancing performance and speed, we use 20 neurons per layer for the remainder of the experiments.

We assess the computational time per iteration; roughly 99% of the computational budget is allocated to solving the LPs. On average, the polynomial VFA is solved in 0.02 per iteration, the NN(1,20)-VFA takes 0.16, and the NN(3,20)-VFA takes 0.39. Due to the additional sets of variables and constraints, NN-VFAs are inherently slower to compute than the PL-VFA. Table 2 shows the times for other network configurations also; both adding layers and neurons significantly increases the computational effort.

To conclude, we reflect on the qualities of the NN-VFA policies. An example of offline performances – measured after each 10,000 training iterations – is shown in Figure 3. This example illustrates that from 20,000 iterations onwards, the PL-VFA is considerably outperformed. In general, we noted that the PL-VFA rather quickly results in a stable – but often clearly suboptimal – policy. Furthermore, the NN(3,20)-VFA performs better than the NN(1,20)-VFA and is more stable over time. Figure 3: Example of offline policy performance R(⋅) for various ¯Vn,n∈{0,N}

Table 3 shows the average policy performance of repeated replication, measured after completing

training iterations. The NN(1,20)-VFA outperforms the PL-VFA by 10.1% and the NN(3,20)-VFA does so by 19.6%. Although both NN-VFAs achieve comparable policy performance at times, the NN(1,20)-VFA is more prone to fluctuating performances, which is also indicated by its higher standard deviation. Furthermore, the NN(3,20)-VFA simply has more expressive power. The results demonstrate that the NN-VFAs significantly outperform the PL-VFA for our transportation problem.

## 6 Conclusions

This paper introduces the integration of linear programs and value function approximations in the form of neural networks, geared towards solving high-dimensional and combinatorial problems in operations research. Our proposed hybrid method is rooted in the framework of approximate dynamic programming. Traditionally, large action spaces in OR problems are handled by formulating the decision problem as a linear program, yet it is difficult to properly define polynomial VFAs in this context.

The main contribution of the NN-VFA is the reduced effort of manual feature design, which is a crucial and precarious step in all solutions relying on VFAs. Unlike PL-VFAs, the NN-VFA is able to learn higher-order effects of simple input features without explicitly designing them, reducing the effort for manual feature design. This is particularly relevant when embedding VFAs in linear programs, in which the design of nonlinear features may be a cumbersome task.

We test our solution method on a representative transportation problem with a large discrete action space, a complex optimal policy, and a multi-component reward function. We compare NN-VFAs to the traditional PL-VFA, keeping all other factors equal. We observe significant improvements in performance. The findings are also robust with respect to neural network configurations; with various settings for training iterations, learning rates, neurons, and layers, the PL-VFA is consistently outperformed. NN-VFAs with multiple hidden layers yield the best and most stable policies, but also require more iterations to converge and more computational effort per iteration. We emphasize that this paper is an exploration of integrating LPs and NN-VFAs; additional research on different problems is needed to draw more general conclusions about the NN-VFA. In our opinion, the obtained results warrant such further studies.

## Acknowledgments

This work is part of the research program Scalable Interoperability in Information Systems for Agile Supply Chains (SIISASC) with project number 438-13-603, which is partially funded by the Netherlands Organization for Scientific Research (NWO).

## References

• [Bertsekas and Tsitsiklis1995] Dimitri Bertsekas and John Tsitsiklis. Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control, volume 1, pages 560–564. IEEE Publ. Piscataway, NJ, 1995.
• [Bertsekas2008] Dimitri Bertsekas. Neuro-dynamic programming. In Encyclopedia of optimization, pages 2555–2560. Springer, 2008.
• [Bunel et al.2018] Rudy R Bunel, Ilker Turkaslan, Philip Torr, Pushmeet Kohli, and Pawan K Mudigonda. A unified view of piecewise linear neural network verification. In Advances in Neural Information Processing Systems, pages 4791–4800, 2018.
• [Cybenko1989] George Cybenko.

Approximation by superpositions of a sigmoidal function.

Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
• [De Farias and Van Roy2003] Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003.
• [Delalleau and Bengio2011] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pages 666–674, 2011.
• [Dulac-Arnold et al.2012] Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari. Fast reinforcement learning with large action sets using error-correcting output codes for mdp factorization. In

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

, pages 180–194. Springer, 2012.
• [Gurney2014] Kevin Gurney. An introduction to neural networks. CRC press, 2014.
• [Haykin2009] Simon Haykin. Neural networks and learning machines, volume 3. Pearson Upper Saddle River, 2009.
• [He et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

In

Proceedings of the IEEE international conference on computer vision

, pages 1026–1034, 2015.
• [Huang et al.2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
• [Lin et al.2017] Henry Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
• [Pazis and Parr2011] Jason Pazis and Ron Parr. Generalized value functions for large action sets. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1185–1192, 2011.
• [Pérez Rivera and Mes2017] Arturo Pérez Rivera and Martijn Mes. Anticipatory freight selection in intermodal long-haul round-trips. Transportation Research Part E: Logistics and Transportation Review, 105:176–194, 2017.
• [Powell et al.2007] Warren B Powell, Belgacem Bouzaiene-Ayari, and Hugo P Simao. Dynamic models for freight transportation. Handbooks in operations research and management science, 14:285–365, 2007.
• [Powell2011] Warren Powell.

Approximate Dynamic Programming: Solving the curses of dimensionality

, volume 2.
John Wiley & Sons, 2011.
• [Powell2016] Warren Powell. Perspectives of approximate dynamic programming. Annals of Operations Research, 241(1-2):319–356, 2016.
• [Rolnick and Tegmark2018] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In International Conference on Learning Representations, 2018.
• [Schmidhuber2015] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
• [Sutton and Barto2018] Richard Sutton and Andrew Barto. Reinforcement learning: An introduction. MIT press, 2018.
• [Van Heeswijk and La Poutré2018] Wouter Van Heeswijk and Han La Poutré. Scalability and performance of decentralized planning in flexible transport networks. In 2018 IEEE International Conference on Systems, Man, and Cybernetics, pages 292–297. IEEE, 2018.
• [Van Heeswijk et al.2019] Wouter Van Heeswijk, Martijn Mes, and Marco Schutten. The delivery dispatching problem with time windows for urban consolidation centers. Transportation Science, 53(1):203–221, 2019.
• [Wilmanski et al.2016] Michael Wilmanski, Chris Kreucher, and Jim Lauer. Modern approaches in deep learning for SAR ATR. In Algorithms for synthetic aperture radar imagery XXIII, volume 9843. International Society for Optics and Photonics, 2016.