Approximate policy iteration using neural networks for storage problems

by   Trivikram Dokka, et al.
Loughborough University

We consider the stochastic single node energy storage problem (SNES) and revisit Approximate Policy Iteration (API) to solve SNES. We show that the performance of API can be boosted by using neural networks as an approximation architecture at the policy evaluation stage. To achieve this, we use a model different to that in literature with aggregate variables reducing the dimensionality of the decision vector, which in turn makes it viable to use neural network predictions in the policy improvement stage. We show that performance improvement by neural networks is even more significant in the case when charging efficiency of storage systems is low.


page 1

page 2

page 3

page 4


Approximate Policy Iteration Schemes: A Comparison

We consider the infinite-horizon discounted optimal control problem form...

Dynamic Policy Programming

In this paper, we propose a novel policy iteration method, called dynami...

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Approximate Policy Iteration (API) algorithms alternate between (approxi...

Trusted Approximate Policy Iteration with Bisimulation Metrics

Bisimulation metrics define a distance measure between states of a Marko...

Multiagent Rollout and Policy Iteration for POMDP with Application to Multi-Robot Repair Problems

In this paper we consider infinite horizon discounted dynamic programmin...

Solving Sinhala Language Arithmetic Problems using Neural Networks

A methodology is presented to solve Arithmetic problems in Sinhala Langu...

1 Introduction

The continuous prevalence of the SMART grid as the next-generation power grid that enables the use of advanced technologies, equipment and controls to deliver electricity more reliably and efficiently has caused a rapid change in the generation and consumption of electricity. Besides this, the drive to reduce emissions from the electricity supply has resulted in an increase in the integration of renewable energy sources such as wind and solar in the generation of electricity. This is evidenced by [21] as a quarter of the total electricity generation in the UK in 2017 were provided by renewable energy sources compared to 5% in 2016. However, due to their high volatility and intermittency, renewable energy sources are often paired with energy storage devices like batteries to increase the value of the energy generated and optimize existing grid connections. Battery storage provides many benefits such as peak shaving, time shifting, electricity price arbitrage, balancing costs, provision of operating reserve and reduction of curtailment [4, 5, 36]. The combination of smart metering systems, smart grids, decentralised renewable generation and energy storage systems provide end-users the opportunity to generate electricity as well as monitor and control their consumption in order to reduce their electricity bills. To realize the full potential of integrating renewable energy sources with battery storage systems, there is the need to determine the best possible energy allocation decisions through the optimization of the control policy. In other words, one should be able to decide when to store, release, buy and sell energy in order to maximize profit. Therefore, researchers from fields such as Operations Research, Computer Science and Electrical Engineering have been paying more attention to proposing solutions and recommendations to the problem of optimally managing energy from different sources such as renewable energy connected to the grid, batteries, households, consumers, prosumers or some other form of energy sink [6, 12, 15, 33, 35]. In this work we consider the single-node energy storage (SNES) problem as defined by [7]. Informally, SNES is when there is a single energy-producing node in the smart grid that wishes to maximize their own objective without taking into account the goals of the system operator.

The problem is connected to classical inventory optimization problems. Optimization with storage decisions appear in other contexts like commodity storage and currency trading. Even though all these problems can be grouped into a similar class, each of them have their own unique characteristics which impact the hardness of the problem when solving them. SNES can be modeled as a stochastic dynamic program and many studies in literature (see Section 2) propose a variety of approaches to solve the resulting hard dynamic programs. Among these are the popular Approximate dynamic programming (ADP) approaches. One such technique within ADP is the Approximate policy iteration (API) which is often used to solve large stochastic dynamic programs. There are many variants of API which prove useful depending on the specific problem being solved. In this work we investigate a novel API algorithm which employs neural networks to approximately solve the SNES problem. To the best of our knowledge we are only aware of these two papers [9, 14], which combine neural network within API but in both these works the approach differs from ours in terms of their neural network structure and implementation. [9] proposes a neural network based approximation scheme which does not take the ADP approach.

While the SNES and other storage problems have been studied, rather industriously in the last few years, the rapidly evolving algorithmic technologies and the changing nature of the problem (for example storage is rented instead of owned which partially removes the constrained storage bottleneck but introduces some cost implications), means that algorithm performance can now be different, and perhaps better than previously understood. This makes a case for re-investigating known algorithms to boost their performance employing the new armory of technologies like advanced machine learning algorithms. The scope of this paper is to revisit API to explore ways to boost its performance by modeling simplification and using powerful frameworks like neural networks.

The rest of the paper is organized is as follows. In Section 2 we give a literature review of the existing work on the problem, in Section 3 we define an example mathematical model from literature and our proposed model as well as explain the differences between them, in Section 4 we give insights into the structure of the optimal policy, in Section 6 we describe our algorithm, in Section 7 we describe our experimental setup including data used, benchmarks, metrics and discuss our findings from numerical experiments, and lastly we conclude in Section 8.

2 Literature review

The energy storage problem is strongly linked to inventory optimization problems as they both deal with making decisions on how to meet demand from supply sources. The demand and supply could either be deterministic or stochastic and these have been well studied under the inventory optimization theory concepts [23, 37]. Also, due to the possibility of trading energy with the grid, this problem can be related to research conducted in commodity trading literature such as [25, 28]. In particular, Secomandi in [28], focuses on the commercial management of a commodity storage asset and determining the optimal inventory-trading policy under capacity constraints and stochastic spot prices.

Current literature on the single-node energy storage problem are more concerned with the formulation as well as the numerical solution and analysis of the mathematical models that have been proposed. Similar to the settings of the problem discussed in this paper, [36]

investigate the management of a merchant wind farm co-located with a grid-level storage facility and connected to the market via a transmission line. They formulate the problem as a finite-horizon Markov decision process. However, in contrast to majority of research conducted on this problem,


consider negative electricity prices which is the case in most deregulated markets. Hence, they also propose some heuristics and assess their performance to the optimal policy. Most papers also assume that the price of buying and selling energy on the spot market is the same

[8, 12, 26]. According to [8], this assumption makes the stochastic version of the SNES problem solvable in polynomial time by dynamic programming. However, when different buying and selling prices are used in the model, the stochastic version of the problem becomes #P-hard [8]. This may be the reason for the use of the same buying and selling price in the models proposed by most literature. They also show that the deterministic case of the single-node energy storage problem can be solved strongly in polynomial time.

[12] consider the use of Approximate dynamic programming (ADP) since the implementation of backward dynamic programming, especially for large-scale problems, can quickly become intractable and computationally intensive. [20] explore algorithms based on ADP and numerically compare their performance against the Lyapunov optimization-based algorithms in terms of the value of storage and renewable energy sources. In the case of infinite horizon, [11] use a stochastic dynamic program formulation to minimize the average cost derived from the installation and management of a storage device integrated with a renewable energy source to meet uncertain demand under dynamic pricing. They also show that the optimal management policy has a dual threshold structure, which is also discussed in [25, 28] under the commodity trading context. [30] consider a rule-based dispatch scheme for the finite-horizon energy storage problem without taking into account the effect of prices or the variability of wind. [16] suggest the need to consider the effect of risk on the optimal policy following the results from the risk analysis they conducted on an optimal deterministic risk-neutral policy and a simple myopic policy.

Other approaches within the broad ADP framework include value function approximations. The ADP proposed in [19] iteratively constructs piece-wise linear and concave value function approximations to help determine the solution of complex storage problems. They also prove that their algorithm converges to an optimal policy by learning the optimal value functions for important regions of the state space as selected by their algorithm. [18] is an extension of the algorithm and results of [19] to problems where the decision vector is a potentially high-dimensional continuous vector. Similar to [19, 18], [26] uses the concavity nature of the value function approximations to speed up the convergence of their proposed finite-horizon ADP algorithm. Also, their algorithm designs near optimal time-dependent control policies for energy storage problems involving multiple storage devices.

[24] propose the Stochastic Multiscale model for the Analysis of energy Resources, Technology, and policy (SMART) algorithmic strategy based on the ADP framework to model long-term investment decisions and economic analyses of portfolios for energy technologies in the presence of uncertainty. Within ADP falls (approximate) value iteration (AVI) methods which require a lookup table representation of the state space [12]. From [12, 13, 17], one can deduce that pure lookup AVI table perform poorly in practice due to very slow convergence rate despite the existence of convergence theory. However, results from [12] show that structured lookup table AVI outperform other more general approaches like API paired with a generic approximation technique. Nevertheless, it is limited to low-dimensional state-of-the-world variable or moderately sized-problems. Therefore, [10] approximates the value functions by employing Dirichlet process mixture models which scales well to large state spaces. The mixture models are used to cluster the states so that convex value functions can be fit within each cluster.


compare the performance of various approximation architectures implemented with ADP approaches such as API, AVI and direct policy search on benchmark instances of the finite-horizon energy storage problem. They use popular nonparametric estimators like Support Vector Regression (SVR), Gaussian Process Regression (GPR), Local Polynomial Regression (LPR) and Dirichlet Cloud-Radial Basis Function (DCR) during the the policy evaluation phase of the API algorithm. SVR has a far better performance with API compared with the other three approximation techniques that were considered by

[12]. In [14], neural networks are used to approximate the performance index needed to analyze the convergence and stability properties of their proposed policy iteration ADP method for solving the infinite horizon optimal control problem for nonlinear systems. [27] focus on the use of a parametric linear model with pre-specified basis functions, least-square temporal difference (LSTD) and Bellman error minimization with approximate policy iteration to solve the same energy allocation problems considered by [12]. A similar approach is used by [15] to find the optimal infinite horizon storage and bidding strategy in the day-ahead market for a renewable power generation and energy storage system.

3 Model

We will now introduce the problem and model parameters formally and we assume a single renewable source and single battery. The problem has the following parameters:

  • T: number of time periods (T+1 is the terminal storage)

  • : storage rent, in per MWh per time step.

  • and : charging and discharging inefficiencies of the battery

  • , : maximum charging and discharging rates of the device

The exogenous information is represented by the following random variables:

  • : amount of energy produced by the renewable source at time

  • : energy demand of the household (or some other type of energy sink) at time

  • : buying price of electricity at time

  • : selling price of electricity at time

The exogenous information is given by the vector . The set of possible realizations of is denoted by . Following [7] and [24], we make the following assumptions: the stochastic process

is finite discrete-time Markov; the support and transition probabilities are given;

is measurable and finally .

We will first give the model often used in literature which we refer to as the flow model. Identifying Grid by letter G, battery by letter R, demand by D, energy source by E; the decision variables at each time are the flow variables, given by , where indicates energy flow from device to device at time . Denoting the amount of energy in the battery at time by we have the following transition equation:


The decision vector should satisfy the following constraints:


The single period profit function is given by


The goal is to find a policy which maps the states to the decisions that maximizes profit at each time.

Most literature mainly use the flow model with minor variations owing to different assumptions and proposes algorithms, exact or heuristic, to solve this problem. Our contribution in the work differs from these in the following ways:

  1. We capture the decisions using much simpler decision variables, mainly inspired by [28], and capture the losses within the objective function rather than as constraints hence deviating from the approach taken by previous studies. This enables us to reduce the dimensionality of the decision space from eight to one. In optimization literature, relaxations are often useful in obtaining very close-to-optimal solutions and also serve as useful indicators of solution quality.

  2. Secondly, our variables are discrete integer valued variables as against continuous variables. Main motivation for this is the fact that energy in measure in discrete units and continuous variables are often employed more for computational convenience with rounding applied in the resulting solutions.

We now give our mathematical formulation for the single node energy storage problem and we refer to our model as the aggregate model since our variables may be seen as aggregated flow variables. We are interested in determining a non-negative vector , where indicates the amount of energy sold to grid, purchased from grid and stored respectively in time period .

The decision vector must satisfy the following constraints:


The right hand side of (8) is (demand-supply), that is, net demand. Note that due to our choice of decision variables we have , with this in mind we write the single period profit function as follows:


where is the cost incurred due to loss of energy owing to the inefficiency of the system. In any given time the valuation of loss depends on whether the decision-maker decides to buy or sell. Clearly, implies buying and selling at the same time is sub-optimal. The inefficiency losses are valued at when buying and at when selling. However, in our experiments we valued all losses at . We highlight here that valuing losses in monetary units rather than in energy terms is plausible especially when the battery operation is outsourced. Therefore, our model is easily usable in the scenario where the storage is not directly owned by the decision-maker, instead the decision-maker uses a storage-as-service model. The loss terms in the aggregated model’s profit function has a more general motivation, meaning, it can be seen as variable cost charged for injection and withdrawal into the storage which depends on the amount of energy injected or withdrawn. In many practical storage problems this is often the case, hence our model captures a more general scenario of which SNES is a special case.

Observe that the profit function in the aggregate model is non-linear as opposed to the flow model where the profit function is linear. This may seem a bad idea at first sight but it gives us a trade-off by reducing the decision vector dimensionality which can be very useful in the policy improvement stage of a policy iteration algorithm.

For the sake of completeness we present the dynamic programming (DP) formulation but we did not attempt to solve the resulting DP. For the ease of exposition let and . Using Equation 8 and this notation, we can eliminate the variables and . The following definition will be useful, . Note that with this notation our decision variable at any given time is single-dimensional which when positive means injection of energy in the battery and negative means withdrawal of energy from the battery.

Following [28], at any decision time, the sets of feasible withdrawal and injection decisions, respectively, with current battery level are defined as and . We denote the set of all feasible actions by . The revenue function is


An energy management policy can be obtained by solving a finite horizon MDP using the following dynamic programming recursion.


The formulation should be interpreted as: in the last stage we can buy or sell depending on the net demand energy but do not inject; in the remaining stages, we have eight actions resulting from the Cartesian product of {sell, buy, neither sell nor buy } and {inject, withdraw, neither inject nor withdraw}. We will now give some structural insights into the optimal policy, in the same vein as discussed in [28].

4 Optimal policy structure

This section analyzes the structure of the optimal policy. Although there are a number of studies in literature focusing on solving or approximately solving the DP we are not aware of any study analyzing the structure of the optimal policy except that of [28]. Our problem is a generalization of the problem studied in [28] in two ways:

  1. in our setting buying and selling happen at different prices, and

  2. in each period we have (stochastic) demand and (stochastic) production (wind).

In [28], the author shows that when , are (much) less than the maximum battery capacity the optimal action at each time period depends not only on the spot price but also on the initial battery level. This is a very important insight because it results in the optimal action space having a specific structure split into three phases (inject, do-nothing, withdraw) depending on the initial battery level, hence departing from this structure and employing sub-optimal action can result in very low payoff.

Observation 1.

When buying and selling prices are equal the optimal action in a period only depends on the initial battery level, prices and injection/withdrawal rates but not on the demand and wind profiles.


The statement means that at fixed prices, battery level and rates, changing the wind and demand levels in each period will not shift the optimal action from injection to withdrawal and vice-versa. To see this note that if injection is optimal in the current period for a given demand and wind profiles, the injection may be done for two reasons: satisfy future demand and selling in the future. Since buying and selling prices are equal, storing in the current period is equivalent to buying in respect of not earning revenue in that period. Therefore, irrespective of demand it is optimal to store as long as prices remain the same. ∎

This aligns with the results in [8] that the case with equal buying and selling prices is easy compared to the case otherwise. On the other hand when buying and selling prices are not equal different demand patterns can result in different optimal strategies.

Observation 2.

When buying and selling happen at different prices optimal actions also depend on wind and demand profiles.


Consider for example the price profile given in Figure 1 where dotted lines indicate selling and thick lines indicate buying prices. Clearly, it is not optimal to buy in period one to sell in period three or in period two. Therefore, injecting is optimal at any initial battery level only if net demand in period three is positive, in which case depending on injection and withdrawal rates it may be optimal buy and store in period one to satisfy demand in period three. On the other hand if net demand is non-positive in period three it may not be optimal to buy and inject in period one. This illustrates the complexity of our problem compared with [28]. ∎

Figure 1: Example price profile for three periods

4.1 Negative surplus

In the case of negative surplus, that is when wind plus energy in the battery is less than demand the optimal policy structure is similar to that in [28] with three phases: buy-and-inject, buy, buy-and-withdraw. As indicated in Figure 2, depending on the initial battery level and the buying price, the optimal decision varies. We highlight here, as noted in [28], the injection and withdrawal rates play a key role in the optimal action.

Figure 2: Illustration of the optimal policy structure for a given stage and prices in the case of negative surplus

4.2 Positive surplus

In contrast to the negative surplus case the structure of the optimal action policy in any given period with positive surplus is much complicated, and depends, at a given time and given buying and selling prices, not only on the inventory but also on the total net surplus as shown in Figure 3.

Figure 3: Illustration of the optimal policy structure for a given stage and prices in the case of positive surplus

Before going to discuss API for the SNES problem, we will briefly describe the deterministic version of the problem and give the Integer program (IP) formulation.

5 Deterministic case

The deterministic version of SNES, that is, when the demand, wind and prices in each period are known, is a special case of lot sizing problem with losses, bounded inventory and constrained injection/withdrawal rates. Lot sizing problem is very well studied in operations research literature including deterministic and stochastic variants, see [22]. However, we are not aware of a reference which studies with constrained injection/withdrawal rates. [8] gave a reduction of SNES to the minimum-cost flow problem proving its polynomial solvability. It appears a similar reduction can be used to solve the SNES problem modeled with aggregate variables. Since this not our main objective in this work we leave this for future work and give a simple integer programming formulation for deterministic SNES: Let

; . We have the following variables for every period :

total withdrawal from battery when is one of {buy-withdraw, sell-withdraw, withdraw}
total units stored () in period
total units sold () in period
total units bought () in period
max (20)
s.t. (21)
Lemma 3.

LS-L-BI is a valid IP formulation of deterministic SNES.


(21) ensures exactly one type of decision is taken from set , (22) and (23) make sure that injection and withdrawal limits are respected in each time period, (24)-(26) make sure that exactly one of injection or withdrawal variables is positive and negative sign in objective function implies (25) and (26) will be tight in an optimal solution. Finally (27) is a balancing constraint which ensures feasibility. ∎

We use (20)-(29) in solving deterministic SNES in numerical experiments in evaluating API policies. As a last remark in this section, we note that we can formulate deterministic SNES more compactly without having variables and deduce the decision from and variables. However, the above integer programme gives this information as its output. Since the instances we solve are very small in size the computation times using the above IP are still very small making it feasible.

6 Approximate policy iteration with neural networks

Policy iteration (PI) is a well-known algorithmic technique used to solve stochastic dynamic programs (DP). Several results exist on the convergence of PI for DPs, see for example, [3, 1]. PI has two main steps, evaluation and improvement. A policy is a mapping of a state space to a decision space. A policy is feasible if it satisfies all constraints. The idea is to start with a feasible policy and iteratively improve it after evaluating at each iteration. It is shown, under certain mild assumptions, the policy converges to the optimal policy. Exact policy iteration requires evaluating the entire state space multiple, if not many, times which is almost an impossible task even for moderately large state spaces. This made many researchers focus on approximate ways of implementing policy iteration while still maintaining convergence. For a survey about API see [2]. Simulation is often used instead of full state space evaluation. This is the approach we take in this paper. Instead of evaluating a policy on the full state space we use monte-carlo simulation to generate a fixed number of sample states and then learn the value function using a machine learning model which takes as input the current state and decisions and predicts the future payoff. Some studies refer to the machine learning framework used to learn an approximate value function as approximation architectures. Many approximation schemes have been proposed in literature. Most of these use (some sort of) linear approximation architectures, for example the value function is expressed as a linear function of known set of features or sometimes also called basis functions. The policy evaluation data from the simulation is then used as input to fit coefficients for these features, for examples, see [12, 13] and references therein. In the cases when the features are already not known, learning algorithms like neural networks can be used to learn these features. This is the idea proposed in [3]. In this paper we do not restrict to linear architectures for value function approximation, instead we use neural networks which are capable of learning highly non-linear functions. We give a formal description of our algorithm in Algorithm 1, where in Step 5 stands for neural networks.

Using machine learning within API is not new and have already been explored before, owing to the fact that learning high dimensional functions is central to machine learning theory. In [12]

several machine learning techniques like support vector machines are used for Step 5. Original ideas of using neural networks within approximate policy evaluation was proposed within reinforcement literature many years back, see for example

[31]. More recently, [3] proposed API with neural nets to approximate the cost function of policy evaluation using feature based approximation where linearly linked features are learned using neural networks. Our work is based on exactly similar ideas but we do not use any feature based combination to approximate value function instead we use deep neural networks as black box models to predict function values. As [3] points out, to use neural networks within API which does not assume linear combination of features requires development of models which enable dimensionality reduction. In our model we achieve this by valuing injection and withdrawal losses in monetary terms and moving them to the objective function. This enables us to express the decision vector in just one dimension, that is, we just have one variable which measure how much energy is stored in the battery in each period. For a fixed value of storage in a time period, , the unique values for buying and selling decisions can be easily derived. We formally state this in the following Lemma. Before stating the Lemma we give the following observations.

Observation 4.

In every time period, only one of the buying and selling decisions is optimal.


The statement has to be true due to the assumption . ∎

Observation 5.

It is not optimal to both inject and withdraw from the battery in the same time period.

Lemma 6.

Given Algorithm 2 computes the optimal values of and .


Trivial. ∎

Step 0: Set initial policy , set
Step 1: Set
Step 2: Select initial battery level
for  to  do
   Step 3a: sample
   Step 3b: evaluation Apply policy:
end for
Step 4: if , and return to Step 1.
Step 5: Approximate value function:
Step 6: Improvement:
Step 7: if , goto Step 1
Algorithm 1 Approximate Policy Iteration with Neural Networks (APINN)

With just one decision variable, the policy improvement stage is much simpler: select the best decision by enumerating over all decisions given a state. The main drawback of using non-linear approximations is removed because the policy improvement stage is now a single dimensional optimization rather than a non-linear multi-dimensional problem which is hard to handle.

A full policy improvement which updates policy for each possible state can still be very time consuming and may even be unnecessary. We take a simulation approach even in the improvement stage. That is, we generate samples of exogenous information and compute optimal decisions for each possible initial battery level.

In our experiments we choose , , and . We implemented step 2 for initial battery levels 0 to 10. This in total gives an input data size of 300000 for neural networks in step 5.

Input: , , ,
if  then
   if  then
      if  then
      end if
      if  then
      end if
   end if
   if  then
      if  then
      end if
      if  then
      end if
   end if
   if  then
      if  then
      end if
      if  then
      end if
   end if
   if  then
   end if
   if  then
   end if
end if
Algorithm 2 Compute-decisions

7 Experimentation

We now discuss the details of our numerical experiments. This section is organized as follows, first we explain the data generated for our experiments and then discuss the parameters chosen for the neural network, then discuss the API parameters and finally we present our numerical results.

7.1 Data

7.1.1 Exogenous information process

The exogenous information matrix from which we sample to determine the next exogenous state for our proposed algorithm was generated using the stochastic processes and benchmark probability distributions (particularly, the Discrete Uniform Distribution and the Discrete Pseudonormal Distribution) described in

[12, 26]. We skip the exact details of these distributions and refer the reader to [12, 26].

Demand Sampling Process

The minimum demand () and maximum demand () were given the values of 1 and 15 respectively. Following [12], the demand is sampled to have a seasonal structure (that usually exists in observed energy demand):

is pseudonormally distributed, and discretized over the set {0, , }. The demand for the next time period is then selected using,


Renewable Energy Source Sampling Process

The sampling process for the stochastic renewable source used the

-order Markov Chain model, with

= 1 and = 7. The values between and were then discretized at a level of = 1 and used as support for the sampling process. Therefore, the renewable energy for the next time period is given by,


are independent and identical random variables that can be either uniformly distributed over the set {0, } or pseudonormally distributed, (0, ) and discretized over the set {0, , , …, }.

Prices Sampling Process

Since we have two types of prices: Buying Price () and Selling Price (), we defined four parameters; = 3, = 13, = 2 and = 12, to support the price sampling processes. Two of the three price processes described in [12] were also considered using the pseudonormal distribution defined in [26]. The same distribution parameters are used for both price types. Hence the formulas described below apply to and even though they have only been defined in terms of .
The first method considers a -order Markov Chain Process with and discretized over {0, , , …, }.


The second sampling process referred to as the Markov Chain with jumps includes the simulation of price spikes. In this case, and discretized over {0, , , …, }, and


7.1.2 Battery storage parameters

The minimum and maximum battery storage level, and are set at 0 and 30 respectively. Therefore, can take any value between discretized at a level of = 1. We set the storage rent (), injection rate () as well as the withdrawal rate () as follows: = 0.0005, and = 3. Recall that in our model the efficiency losses are accounted for in the objective function using a penalty term. In fact the terms in the profit function concerning losses make it non-linear. This implies that the larger the inefficiency, the larger their effect on the profit function which in turn will require more specialized machine learning technique for approximating value function in Step 5 of Algorithm 1. To take this into account we experiment with scenarios of battery efficiencies: scenario high - = = 0.05; scenario low - = = 0.3.

Remark 7.

The choice of bounds for , and is driven by the computational times required to train API for 10 policy improvements.

7.2 Approximation architectures

In order to compare Neural networks (NN) with other approximation architectures, we chose the Multiple Linear Regression (LR) and Support Vector Regression (SVR). Recall that these approximation architectures are used to predict the future contribution

given the set of 5 inputs as the time period , the previous energy storage amount and the decisions for that time period: energy bought , energy sold and energy stored obtained from the approximate policy evaluation phase. We provide a brief description of how these architectures were implemented in Python.

Multiple Linear Regression (LR)

This is often used to predict the target/dependent variable as a weighted sum of the input/independent variables since this algorithm makes it easy to estimate, understand and explain the relationship between the dependent and independent variables.


We use the Ordinary least squares Linear Regression model from the scikit-learn library to find the optimized values of the coefficients that minimize the squared differences between the actual and the estimated outcomes of the dependent variables.


Support Vector Regression (SVR)

The basic idea is to find a function

also known as the hyperplane which is as flat as possible and also deviates from a set of observed response values

by a value no greater than the margin of tolerance for each training point [12, 29].


Flatness of the hyperplane is achieved by minimising the value of . This optimization problem can be represented with the formulation stated in [32].


We use the Linear Support Vector Regression model with a linear kernel and not the Epsilon-Support Vector Regression model in the scikit-learn library for our numerical experiments since it is more flexible in the selection of penalties and loss functions and scales better to large samples. We choose the epsilon-insensitive loss function and

and the penalty parameter are set to the default values of 0.0 and 1.0 respectively. Note that determines the trade-off between the number up to which deviations larger than are tolerated and the flatness of the hyperplane . The maximum number of iterations was set to 1000 with a tolerance of 1e-05.

Neural Network (NN)

Neural networks can be classified as black box models which are useful in capturing many kinds of dynamic and non-linear relationships and patterns in both structured and unstructured datasets. We used the keras deep learning library to implement our deep neural network. We employed a feed forward neural network with 5 nodes in the input layer, 10 nodes in the first hidden layer, 10 nodes in the second hidden layer and a single node in the output layer which is used for our predictions. The nodes mentioned above are of a dense layer type as all nodes in the previous layer are connected to the nodes in the current layer. We also included two dropout layers set to 0.2 that randomly select neurons to be ignored during training.

The rectified linear unit

activation function is used within the first two layers whereas the linear activation function is employed in the output layer for predictions. We used the mean squared logarithmic error as our loss function. The loss function is minimized by backpropagating the current error to the previous layer where it is used to modify the weights and bias through the Adam optimization algorithm. We chose Adam which is shown to work well in practice and compares favorably to other stochastic optimization methods. We used the default settings for the Adam optimizer suggested on the keras documentation for training our model.

We use a batch size of 100 and 15 epochs during the training process of our network. Our data is initially split with 70% as the training samples and 30% as the test samples. The network is then trained using the training samples and its parameters are iteratively adjusted until the loss function is minimised. The validation split is then used to randomly select 20% of the test samples for prediction using the trained neural network and the validation loss is computed after each epoch. This process is continued until the validation loss does not decrease and the optimized weights of the model with the lowest validation loss is selected


7.3 Evaluation benchmarks

We have five problem classes differing mainly in the distributional settings for generating the exogenous information. We summarize these in Table 1.
We generated 2000 instances for each class to evaluate the performance of the policies from our algorithm. All instances for each class are generated as explained in subsection 7.1 similar to the way sampling is done in training the policy iteration algorithm. The deterministic optimal policies for each of these instances have been computed using the IP given in section 5.

Data Class Price Process
S1 MC + jump
S2 MC + jump
S3 MC + jump
S4 MC + jump
S5 MC + jump
S6 MC + jump
S7 MC + jump
S8 MC + jump
S9 MC + jump
S10 MC + jump
S11 MC + jump
S12 MC
S13 MC
Table 1: Data Classes

7.4 Metrics

To evaluate the performance of our policy against the deterministic optimum on the benchmark instances we use the % optimal metric defined as follows:


7.5 Initial policy

Policy iteration starts with an initial policy which is improved until it converges to the optimal policy. Our initial policy, which we refer to as naive policy, is outlined in Algorithm 3. The reason for this choice of naive policy is to see if our approach can learn starting from such a simple and very naive policy.

if  then
    ; ;
   ; ;
end if
Algorithm 3 Naive policy

7.6 Numerical findings

We are now ready to discuss our numerical findings. First we discuss the computation times. We give running times for the high case in Table 2 and for the low case in Table 3.

9 8.8 10.5
Table 2: Average computation times across all data classes in hours
8.1 8.4 7.7
Table 3: Average computation times across all data classes in hours

We ran the code using Google colab’s Tesla K80 GPU which has a virtual RAM of about 13GB and disk space of about 350GB. Computation times are dominated by neural network prediction times in the improvement stage. This is part of the reason for only using a sample of states at each improvement stage. However, the improvement step can be sped up by parallelizing the improvement step using the fact that the underlying optimization problem is single dimensional. An efficient parallelization strategy may even make exact policy improvement possible instead of simulation. Training the neural network model can also be time intensive, especially, when scaling up to 100s of times periods. The mean-square-log-error is approximately 1.5 in all rounds. Finally, note that since we do not explore the full state space when applying our policy we apply naive policy when we encounter a state which is not used in the improvement stage.

Our numerical experiments are summarized in Figures 4 and 5. A number of observations can be made from Figures 4 and 5:

  • API with Neural networks (NNs) have consistent performance across all classes, coming close to the best of two approaches, LR and SVR.

  • NNs outperform both LR and SVR in low efficiency scenarios, in fact, by a good margin in some classes. We observe that this makes a case in support of NNs as reaching close to optimality becomes even more important in low efficiency scenario. The reason for better performance of NNs in low scenario compared to high scenario may be ascertained to the non-linearity of which becomes more pronounced with low efficiency. NNs do not really outperform other approaches in high efficiency case, understandably, since, with high efficiency is almost linear.

  • Surprisingly, % optimality in high efficiency case is (much) lower compared to low case, for all three policies. This seems to indicate high case is harder than low efficiency case. This counter-intuitive behaviour is due to decrease in the optimal profits.

  • Between LR and SVR, SVR seems to perform well in both scenarios. This is in line with observation in [12]

    on single price problem. However, SVR also has high variance between classes with very good performance on some and very poor on others.

  • The variance in performance, which we illustrate by showing proportion of instances with % optimality greater than 80% in each case, across classes is lowest in NNs, which can be seen from Figure 5. Between low and high scenarios NNs performed most consistently with less variance in low scenario.

  • Our experiments seem to suggest that NNs are more able to deal with discrete decisions compared to the other two approaches.

  • We observed that performance of NN policy at initial iterations is worse compared to LR and SVR, but improves at every iteration of policy improvement. However, we observed the improvement in performance after 10 iterations is marginal compared to the time required. LR and SVR behave very similar to each other in terms of policy improvement but very different from NN. For example, LR and SVR policies improves on average (less than) 10% going from iteration 1 to iteration 10 for classes S4-S13. We suspect the reason being this to be the Psuedo-Normal distribution of


  • Finally, all three policies outperform Naive policy by a large margin which achieves less than 55% optimality on high case and less than 60% on low case.

(a) High scenario
(b) Low scenario
Figure 4: Comparison of Average Optimality between NN, LR and SVR
(a) High scenario
(b) Low scenario
Figure 5: Proportion of the 2000 instances with %Optimality greater than 80%

8 Conclusion

We revisit approximate policy iteration methods for solving stochastic SNES by adopting a simpler aggregate model as against commonly used flow model. We show that modeling with aggregate varaibles allows us to use more advanced approximation architectures in policy evaluation stage from ever increasing machine learning armor including the likes of neural networks. We make a case for neural networks by illustrating that approximate policy iteration with neural networks outperform API with support vector regression which was shown to perform best among several others tried in literature.


We thank Denes Csala for initial discussions on this work. Part of Richlove Frimpong’s work was done when she was at the Centre for Global Eco-Innovation, Lancaster University and acknowledges the centre’s support during this time.


  • [1] D.P. Bertsekas and J.N. Tsitsiklis (1996) Neuro-dynamic programming, athena scientific, belmont, ma.. Cited by: §6.
  • [2] D. P. Bertsekas (2011) Approximate policy iteration: a survey and some new methods. Journal of Control Theory and Applications, pp. 310–315. Cited by: §6.
  • [3] D. P. Bertsekas (2018)

    Feature-based aggregation and deep reinforcement learning: a survey and some new implementations

    arXiv:1804.04577. Cited by: §6, §6.
  • [4] J.M. Eyer and G.P. Corey (2010) Energy storage for the electricity grid: benefits and market potential assessment guide. Technical Report SAND2010-0815, Sandia National Laboratories, pp. 69–73. Cited by: §1.
  • [5] J.M. Eyer, J.J. Iannucci, and G.P. Corey (2004) Energy storage benefits and market analysis handbook, a study for the doe energy storage systems program. Technical Report SAND2004-6177, Sandia National Laboratories. Cited by: §1.
  • [6] N. Gautam, Y. Xu, and J. T. Bradley (2014) Meeting inelastic demand in systems with storage and renewable sources. In 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm), Vol. , pp. 97–102. Cited by: §1.
  • [7] N. Halman, D. Klabjan, M. Mostagir, J. Orlin, and D. Simchi-Levi (2009) A fully polynomial time approximation scheme for single item inventory control with discrete demand. Mathematics of Operations Research 34 (3), pp. 674–685. Cited by: §1, §3.
  • [8] N. Halman, G. Nannicini, and J. Orlin (2018) On the complexity of energy storage problems. Discrete Optimization 28, pp. 31–53. Cited by: §2, §4, §5.
  • [9] J. Han and E. Weinan (2016) Deep learning approximation for stochastic control problems. arXiv:1611.07422. Cited by: §1.
  • [10] L. Hannah and D. Dunson (2012) Approximate dynamic programming for storage problems. Proceedings of the 29th International Conference on Machine Learning. Cited by: §2.
  • [11] P. Harsha and M. Dahleh (2015) Optimal management and sizing of energy storage under dynamic pricing for the efficient integration of renewable energy. IEEE Transactions On Power Systems 30 (3), pp. 1164–1181. Cited by: §2.
  • [12] D. R. Jiang, T. V. Pham, W. B. Powell, D. F. Salas, and W. R. Scott (2014) A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work?. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Vol. , pp. 1–8. Cited by: §1, §2, §2, §2, §2, §6, §6, 4th item, §7.1.1, §7.1, §7.1, §7.2.
  • [13] D.R. Jiang and W.B. Powell (2015) Optimal hour-ahead bidding in the real-time electricity market with battery storage using approximate dynamic programming. INFORMS Journal on Computing 27 (3), pp. 525–543. Cited by: §2, §6.
  • [14] D. Liu and Q. Wei (2014) Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions On Neural Networks And Learning Systems 25 (3), pp. 621–634. Cited by: §1, §2.
  • [15] N. Löhndorf and S. Minner (2010) Optimal day-ahead trading and storage of renewable energies—an approximate dynamic programming approach. Energy Systems 1 (1), pp. 61–77. Cited by: §1, §2.
  • [16] S. Moazeni, W.B. Powell, and A.H. Hajimiragha (2015) Mean-conditional value-at-risk optimal energy storage operation in the presence of transaction costs. IEEE Transactions On Power Systems 30 (3), pp. 1222–1232. Cited by: §2.
  • [17] J.M. Nascimento and W.B. Powell (2010) Dynamic programming models and algorithms for the mutual fund cash balance problem. Management Science 56 (5), pp. 801–815. Cited by: §2.
  • [18] J.M. Nascimento and W.B. Powell (2013) An optimal approximate dynamic programming algorithm for concave, scalar storage problems with vector-valued controls. IEEE Transactions On Automatic Control 58 (12), pp. 2995–3010. Cited by: §2.
  • [19] J.M. Nascimento (2008) Approximate dynamic programming for complex storage problems. Cited by: §2.
  • [20] G. Natarajan, Y. Xu, and J. Bradley (2014) Meeting inelastic demand in systems with storage and renewable sources. in: Proceedings of the IEEE Fifth International Conference on Smart Grid Communications (SmartGridComm), pp. 97–102. Cited by: §2.
  • [21] Ofgem(Website) Note:[Online; Accessed 05-December-2018] Cited by: §1.
  • [22] Y. Pochet and L.A. Wolsey (2006) Production planning by mixed integer programming. springer, new york.. Cited by: §5.
  • [23] E.L. Porteus (2002) Foundations of stochastic inventory theory. Stanford Business Books, Palo Alto, CA. Cited by: §2.
  • [24] W.B. Powell, A. George, H. Simão, W. Scott, A. Lamont, and J. Stewart (2012) SMART: a stochastic multiscale model for the analysis of energy resources, technology, and policy. INFORMS Journal on Computing 24 (4), pp. 665–682. Cited by: §2, §3.
  • [25] R. Rempala (1994) Optimal strategy in a trading problem with stochastic prices. in: J. Henry, J.-P. Yvon (Eds.), System Modelling and Optimization, in: Lecture Notes in Control and Information Sciences 197, Springer Berlin Heidelberg, pp. 560–566. Cited by: §2, §2.
  • [26] D. Salas and W.B. Powell (2013) Benchmarking a scalable approximation dynamic programming algorithm for stochastic control of multidimensional energy storage problems. Technical report, Princeton University. Cited by: §2, §2, §7.1.1, §7.1.
  • [27] W.R. Scott and W.B. Powell (2012) Approximate dynamic programming for energy storage with new results on instrumental variables and projected bellman errors. Technical report, Princeton University. Cited by: §2.
  • [28] N. Secomandi (2010) Optimal commodity trading with a capacitated storage asset. Management Science 56 (3), pp. 449–467. Cited by: §2, §2, item 1, §3, §3, §4.1, §4, §4, §4.
  • [29] A. J. Smola and B. Schölkopf (2004-08-01) A tutorial on support vector regression. Statistics and Computing 14 (3), pp. 199–222. Cited by: §7.2.
  • [30] S. Teleke, M.E. Baran, S. Bhattacharya, and A.Q. Huang (2010) Rule-based control of battery energy storage for dispatching intermittent renewable sources. IEEE Transactions On Sustainable Energy 1 (3), pp. 117–124. Cited by: §2.
  • [31] G. J. Tesauro (2002) Programming backgammon using self-teaching neural nets. Artificial Intelligence 134, pp. 181–199. Cited by: §6.
  • [32] V. Vapnick (1995) The nature of statistical learning. Springer, New York. Cited by: §7.2.
  • [33] X. Xiaomin, R. Sioshansi, and V. Marano (2014) A stochastic dynamic programming model for co-optimization of distributed energy storage. Energy Systems 5 (3), pp. 475–505. Cited by: §1.
  • [34] Y. Yuan, R. Lorenzo, and C. Andrea (2007) On early stopping in gradient descent learning. Constructive Approximation 26, pp. 289–315. Cited by: §7.2.
  • [35] Y. Zhou, A. Scheller-Wolf, N. Secomandi, and S. Smith (2016) Electricity trading and negative prices: storage vs. disposal. Management Science 62 (3), pp. 880–898. Cited by: §1.
  • [36] Y. Zhou, A. Scheller-Wolf, N. Secomandi, and S. Smith (2018) Managing wind-based electricity generation in the presence of storage and transmission capacity. Technical Report 2011-E36, Tepper School of Business, Carnegie Mellon University, Available at SSRN: or Cited by: §1, §2.
  • [37] P.H. Zipkin (2000) Foundations of inventory management. McGraw-Hill, New York, NY. Cited by: §2.