Sequential decision problems, in the form of Markov decision processes (MDPs), are most often formulated with the objective of minimizing an expected sum of costs or maximizing an expected sum of rewards (Puterman, 2014; Bertsekas and Tsitsiklis, 1996; Powell, 2011). However, it is becoming more and more evident that solely considering the expectation is insufficient as risk-preferences can vary greatly from application to application. Broadly speaking, the expected value can fail to be useful in settings containing either heavy-tailed distributions or rare, but high-impact events. For example, heavy-tailed distributions arise frequently in finance (electricity prices are well-known to possess this feature; see Byström (2005), Kim and Powell (2011)
). In this case, the mean of the distribution itself may not necessarily be a good representation of the randomness of the problem; instead, it is likely useful to introduce a measure of risk on the tail of the distribution as well. The rare event situation is, in a sense, the inverse case of the heavy-tail phenomenon, but it can also benefit from a risk measure other than the expectation. To illustrate, certain problems in operations research can be complicated by critical events that happen with small probability, such as guarding against stock-outs and large back-orders in inventory problems (seeGlasserman and Liu (1996)) or managing the risk of the failure of a high-value asset (see Enders et al. (2010)). In these circumstances, the merit of a policy might be measured by the number of times that a bad event happens over some time horizon.
One way to introduce risk-aversion into sequential problems is to formulate the objective using dynamic risk measures (Ruszczyński, 2010). A rough preview, without formal definitions, of our optimization problem is as follows: we wish to find a policy that minimizes risk, as assessed by a certain type of dynamic risk measure. The objective can be written as
where is a set of policies, are costs under policy , and are one-step risk measures (i.e., components of the overall dynamic risk measure). Precise definitions are given in the subsequent sections. We focus on the case where the objective at each stage is to optimize a quantile-based risk measure (QBRM) of future costs; we call the overall objective a dynamic quantile-based risk measure (DQBRM).
This paper makes the following contributions. First, we describe a new data-driven or simulation-based ADP algorithm, called Dynamic-QBRM ADP, that is similar in spirit to established asynchronous algorithms like -learning (see Watkins and Dayan (1992)) and lookup table approximate value iteration (see, e.g., Bertsekas and Tsitsiklis (1996), Powell (2011)), where one state is updated per iteration. The second contribution of the paper is a companion sampling procedure to Dynamic-QBRM ADP, which we call risk-directed sampling (RDS). As we describe above, when dealing with risk, there is a large class of problems in which we are inherently dealing with rare, but very costly events. Broadly speaking, the evaluation of a QBRM that is focused on the tail of the distribution (e.g., CVaR at, say, the 99% level) depends crucially on efficiently directing the algorithm toward sampling these “risky” regions. In this part of the paper, we consider the question: is there a way to learn, as the ADP algorithm progresses, the interesting values of the information process to sample?
The paper is organized as follows. We first provide a literature review in Section 2. In Section 3, we give our problem formulation, a brief introduction to dynamic risk measures, and the definition of a class of quantile-based risk measures. Next, we introduce the algorithm for solving risk-averse MDPs in Section 4 and give a theoretical analysis in Section 5. In Section 6, we discuss sampling issues and describe the companion sampling procedure. We show numerical results on an example energy trading application in Section 7 and conclude in Section 8.
2 Literature Review
The theory of dynamic risk measures and the notion of time-consistency (see e.g. Riedel (2004), Artzner et al. (2006), Cheridito et al. (2006)) is extended to the setting of sequential optimization problems in Ruszczyński and Shapiro (2006a) and Ruszczyński (2010), in which it is proved that any time-consistent dynamic risk measure can be written as compositions of one-step conditional risk measures (these are simply risk measures defined in a conditional setting, analogous to the conditional expectation for the traditional case). From this, a Bellman recursion is obtained, becoming a familiar way of characterizing optimal policies. Building on the theory of dynamic programming, versions of exact value iteration and policy iteration are also developed in Ruszczyński (2010). Later, in Çavus and Ruszczyński (2014), these exact methods are analyzed in the more specific case of undiscounted transient models.
Under the assumption that we use one-step coherent risk measures, as axiomatized in Artzner et al. (1999), the value functions of a risk-averse Markov decision process with a convex terminal value function can be easily shown to satisfy convexity using the fact that coherent risk measures are convex and monotone. Therefore, the traditional method of stochastic dual dynamic programming (SDDP) of Pereira and Pinto (1991) for multistage, risk-neutral problems, which relies on the convexity of value functions, can be adapted to the risk-averse case. This idea is successfully explored in Philpott and de Matos (2012), Shapiro et al. (2013), and Philpott et al. (2013), with applications to the large-scale problem of hydro-thermal scheduling using one-step mean-CVaR (convex combination of mean and CVaR) and one-step mean-upper semideviation risk measures. The main drawbacks of risk-averse SDDP are (1) the cost function must be linear in the state, (2) some popular risk measures, such as value at risk (VaR), are excluded because they are not coherent and do not imply convex value functions, and (3) the full risk measure (can be recast as an expectation in certain instances) has to be computed at every iteration. Since no convexity or linearity assumptions are made in this paper, we take an alternative approach from the SDDP methods and instead assume the setting of finite state and action spaces, as in -learning. At the same time, because the default implementation of our approach does not take advantage of structure, it is limited to smaller problems. Extensions to the methods proposed in this paper for exploiting structure can be made by following techniques such as those discussed in Powell et al. (2004), Nascimento and Powell (2009), and Jiang and Powell (2015a).
Recursive stochastic approximation methods have been applied to estimating quantiles in static settings (seeTierney (1983), Bardou and Frikha (2009), and Kan (2011)). Related to our work, a policy gradient method for optimizing MDPs (with a risk-neutral objective) under a CVaR constraint is given in Chow and Ghavamzadeh (2014). All of these methods are related to ours in the sense that the minimization formula (Rockafellar and Uryasev, 2002, Theorem 10) for CVaR is optimized with gradient techniques. In our multistage setting with dynamic risk measures, which is also coupled with optimal control, there are some new interesting complexities, including the fact that every new observation (or data point) is generated from an imperfect distribution of future costs that is “bootstrapped” from the previous estimate of the value function. This means that not only are the observations inherently biased, but the errors compound over time – this was not the case for the static setting considered in earlier work. Under reasonable assumptions, we analyze both the almost sure convergence and convergence rates for our proposed algorithms.
Our risk-directed sampling procedure is inspired by adaptive importance sampling strategies from the literature, such as the celebrated cross-entropy method of Rubinstein (1999). See, e.g., Al-Qaq et al. (1995), Bardou and Frikha (2009), Egloff and Leippold (2010), and Ryu and Boyd (2015) for other similar approaches. The critical difference in our approach is that in an ADP setting, we have the added difficulty of not being able to assume perfect knowledge of the objective function; rather, our observations are noisy and biased. To our knowledge, this is the first time an adaptive sampling procedure has been combined with a value function approximation algorithm in the risk-averse MDP literature. The closest paper is by Kozmík and Morton (2014), which considers an importance sampling approach for policy evaluation.
3 Problem Formulation
In this section, we establish the setting of the paper. In particular, we describe the risk-averse model, introduce the concept of dynamic risk measures, and define a class of quantile-based risk measures.
We consider an MDP with a finite time-horizon, , where the last decision is made at time
, so that the set of decision epochs is given byGiven a probability space , we define a discrete-time stochastic process , with for all , as the exogenous information process in the sequential decision problem, where is adapted to a filtration , with . We assume that all sources of randomness in the problem are encapsulated by the process and that it is independent across time. For computational tractability, we work in the setting of finite state and action spaces. Let the state space be denoted , and let the action space be , where and . The set of feasible actions for each state , written , is a subset of . The set is the set of all feasible state-action pairs. The stochastic process describing the states of the system is , where is an
-measurable random variable taking values in, and is a feasible action determined by the decision maker using . Furthermore, let denote the space of -measurable random variables and .
We model the system using a transition function or system model , which produces the next state given a current state , action , and an outcome of the exogenous process : . The cost for time is given by , where is the cost function. A policy is a sequence of decision functions indexed by , where is the index set of all policies. Each decision function is a mapping from a state to a feasible action, such that for any state . Let the sequence of costs under a policy be represented by the process for , where
where are the states visited while following policy . Note that refers to the cost from time , but the index of refers to its measurability: depends on information only known at time .
3.2 Review of Dynamic Risk Measures
In this subsection, we briefly introduce the notion of a dynamic risk measure; for a more detailed treatment, see, e.g., Frittelli and Gianin (2004), Riedel (2004), Pflug and Ruszczyński (2005), Boda and Filar (2006), Cheridito et al. (2006), and Acciaio and Penner (2011). Our presentation closely follows that of Ruszczyński (2010), which develops the theory of dynamic risk measures in the context of MDPs. First, a conditional risk measure is a mapping that satisfies the following monotonicity requirement: for and (componentwise and almost surely), .
Given a sequence of future costs , the intuitive meaning of is a certainty equivalent cost (i.e., at time , one is indifferent between incurring and the alternative of being subjected to the stream of stochastic future costs). See Rudloff et al. (2014) for an in-depth discussion regarding the certainty equivalent interpretation in the context of multistage stochastic models. A dynamic risk measure is a sequence of conditional risk measures , which allows us to evaluate the future risk at any time using . Of paramount importance to the theory of dynamic risk measures is the notion of time-consistency, which says that if from the perspective of some future time , one sequence of costs is riskier than another and the two sequences of costs are identical from the present until , then the first sequence is also riskier from the present perspective (see Ruszczyński (2010) for the full technical definition).
Other definitions of time-consistency can be found in the literature, e.g., Boda and Filar (2006), Cheridito and Stadje (2009), and Shapiro (2009). Though they may differ technically, these definitions share the same intuitive spirit. Under the conditions:
it is proven in Ruszczyński (2010) that for some one-step conditional risk measures , a time-consistent, dynamic risk measure can be expressed using the following nested representation:
It is thus clear that we can take the reverse approach and define a time-consistent dynamic risk measure by simply specifying a set of one-step conditional risk measures . This is a common method that has been used in the literature when applying the theory of dynamic risk measures in practice (see, e.g., Philpott and de Matos (2012), Philpott et al. (2013), Shapiro et al. (2013), Kozmík and Morton (2014), and Rudloff et al. (2014)).
3.3 Quantile-Based Risk Measures
In this paper, we focus on simulation techniques where the one-step conditional risk measure belongs to a specific class of risk measures called quantile-based risk measures (QBRM). Although the term quantile-based risk measure has been used in the literature to refer to risk measures that are similar in spirit to VaR and CVaR (see, e.g., Dowd and Blake (2006), Neise (2008), Sereda et al. (2010)), it has not been formally defined. First, let us describe these two popular risk measures, which serve to motivate a more general definition for a QBRM.
Also known as the quantile risk measure, VaR is a staple of the financial industry (see, e.g., Duffie and Pan (1997)). Given a real-valued random variable (representing a loss) and a risk level , the VaR or quantile of is defined to be
To simplify our notation, we use in the remainder of this paper. It is well known that VaR does not satisfy coherency Artzner et al. (1999), specifically the axiom of subadditivity, an appealing property that encourages diversification. Despite this, several authors have given arguments in favor of VaR. For example, Danielsson et al. (2005) concludes that in practical situations, VaR typically exhibits subadditivity. Dhaene et al. (2006) and Ibragimov and Walden (2007) give other points of view on why VaR should not be immediately dismissed as an effective measure of risk. A nested version of VaR for use in a multistage setting is proposed in Cheridito and Stadje (2009), though practical implications have not been explored in the literature.
CVaR is a coherent alternative to VaR and has been both studied and applied extensively in the literature. Although the precise definitions may slightly differ, CVaR is also known by names such as expected shortfall, average value at risk, or tail conditional expectation. Given a general random variable , the following characterization is given in Rockafellar and Uryasev (2002):
Applications of risk-averse MDPs using dynamic risk measures have largely focused on combining CVaR with expectation; once again, see Philpott and de Matos (2012), Philpott et al. (2013), Shapiro et al. (2013), Kozmík and Morton (2014), and Rudloff et al. (2014).
For the purposes of this paper, we offer the following general definition of a QBRM that allows dependence on more than one quantile; the definition includes the above two examples as special cases.
Definition 1 (Quantile-Based Risk Measure (QBRM)).
Let be a real-valued random variable. A quantile-based risk measure can be written as the expectation of a function of and finitely many of its quantiles. More precisely, takes the form
where is a vector of
is a vector ofrisk levels, , , …, , and a function , chosen so that satisfies monotonicity, translation invariance, and positive homogeneity (see Artzner et al. (1999) for the precise definitions and note that we interpret as a random loss or a cost).
Our definition of QBRMs is largely motivated by practical considerations. First, the definition covers the two most widely used risk measures, VaR and CVaR, as special cases under a single framework; in addition, the flexibility allows for the specification of more sophisticated risk measures that may or may not be coherent. As previously mentioned, there are situations where nonconvex (and thus, not coherent) risk measures are appropriate (Dhaene et al., 2006). Another motivation for this definition of a QBRM is that it allows us to easily construct a risk measure such that , because, as Belles-Sampera et al. (2014) points out, one issue with VaR is that it can underestimate large losses, but at the same time, some practitioners of the financial and insurance industries find CVaR to be too conservative.
We see that VaR is trivially a QBRM with . CVaR can also be easily written as a QBRM, using the function . Although our approach can be applied to any risk measure of the form (3.1), we use the CVaR risk measure in the empirical work of Section 7, due to its popularity in a variety of application areas.
3.4 Dynamic Quantile-Based Risk Measures
Notice that, so far, we have developed QBRMs in a “static” setting (the value of the risk measure is in ) for simplicity. Given a random variable and a risk level , the conditional counterpart for the quantile is
Using this new definition, we can similarly extend the definition of a QBRM to the conditional setting by replacing (3.1) with
and replacing the required properties of monotonicity, translation invariance, and positive homogeneity in Definition 1 with their conditional forms given in Ruszczyński (2010) (denoted therein by A2, A3, and A4). For the sake of notational simplicity, let us assume that all parameters, i.e., , , are static over time, but we remark that an extension to time-dependent (and even state-dependent) versions of the one-step conditional risk measure is possible. Let be a (conditional) QBRM that measures tail risk. In applications, a weighted combination of a tail risk measure with the traditional expectation ensures that the resulting policies are not driven completely by the tail behavior of the cost distribution; we may use QBRMs of the form , where .
Using one-step conditional risk measures as building blocks, we can define a dynamic risk measure to be , which we refer to as a dynamic quantile-based risk measure (DQBRM). The dynamic risk measures obtained when and (the conditional forms of VaR and CVaR) are precisely the time-consistent risk measures suggested in Cheridito and Stadje (2009) under the names composed value at risk and composed conditional value at risk.
3.5 Objective Function
We are interested in finding optimal risk-averse policies under objective functions specified using a DQBRM. The problem is
The upcoming theorem, proven in Ruszczyński (2010), gives the Bellman-like optimality equations for a risk-averse model. We state it under the assumption that the current period contribution is random, differing slightly from the original statement. A point of clarification: the original theorem is proved within the setting where the one-step risk measures satisfy conditional forms of the axioms of Artzner et al. (1999) for coherent risk measures. In our setting, however, the QBRM is only assumed to satisfy (conditional forms of) monotonicity, positive homogeneity, and translation invariance, but not necessarily convexity. The crucial step of the proof given in Ruszczyński (2010) relies only on monotonicity and an associated interchangeability property (see (Ruszczyński and Shapiro, 2006b, Theorem 7.1), (Ruszczyński, 2010, Theorem 2)). The assumption of convexity is therefore not necessary for the following theorem.
Theorem 1 (Bellman Recursion for Dynamic Risk Measures, Ruszczyński (2010)).
The sequential decision problem (3.2) has optimal value functions given by
The decision functions of an optimal policy are given by
which map to a minimizing action of the optimality equation.
For computational purposes, we are interested in interchanging the minimization operator and the risk measure and thus appeal to the state-action value function or -factor formulation of the Bellman equation. Define the state-action value function over the state-action pairs to be , for and let . Thus, the counterpart to the recursion in Theorem 1 is
with the minimization occurring inside of the risk measure.
3.6 Some Remarks on Notation
For simplicity, we henceforth refer to simply as the optimal value function. Let and . We consider to be a vector in with components . We also frequently use the notation for some , by which we mean restricted to the components for all . We adopt this system for any vector in (e.g., , , and to be defined later). The norms used in this paper are , , and , the -norm, the Euclidean norm, and the maximum norm, respectively. Example usages of the latter two are
The following naming convention is used throughout the paper and appendix: stochastic processes denoted using , i.e., , , and , are conditionally unbiased noise sequences and represent Monte Carlo sampling error. On the other hand, the processes denoted using , i.e., , , and , are biased noise and represent approximation error from using a value function approximation. For a vector , is the diagonal matrix whose entries are the components of . Lastly, for a nonnegative function , its support is represented by the notation .
In this section, we introduce the risk-averse ADP algorithm for dynamic quantile-based risk measures, which aims to approximate the value function in order to produce near-optimal policies.
4.1 Overview of the Main Idea
Like most ADP and reinforcement learning algorithms, the algorithm that we develop in this paper to solve (3.2) is based on the recursive relationship of (3.3). The basic structure for the algorithm is a time-dependent version of -learning or approximate value iteration (see (Powell, 2011, Chapter 10) for a discussion). Recall the form of the QBRM:
The main idea of our approach is to approximate the quantiles and then combine the approximations to form an estimate of the risk measure. In essence, every observation of the exogenous information process (real or simulated data) can be utilized to give an updated approximation of each of the quantiles. A second step then takes the observation and the quantile approximations to generate an refined approximation of the optimal value function . This type of logic is implemented using many concurrent stochastic gradient (Robbins and Monro, 1951; Kushner and Yin, 2003) steps within a framework that walks through a single forward trajectory of states and actions on each iteration.
It turns out that there is a convenient characterization of the quantile through the so-called CVaR minimization formula. Given a real-valued, integrable random variable , a risk level , and , Rockafellar and Uryasev (2000) proves that
Although the main result of Rockafellar and Uryasev (2000) is that the optimal value of the optimization problem gives the , the characterization of the quantile as the minimizer is particularly useful for our purposes. It suggests the use of stochastic approximation or stochastic gradient descent algorithms (Robbins and Monro, 1951; Kushner and Yin, 2003) to iteratively optimize (4.1).
With this intuition in mind, let us move back to the context of the MDP and define the auxiliary variables , for each , to be the -quantiles of the future costs (recall that the quantiles are defined as an argument to our QBRM in Definition 1). The component at time and state is
for each . Using (3.1), this allows us to take advantage of the equation
The relationship between and is fundamental to our algorithmic approach, which keeps track of mutually dependent approximations and to the optimal values and , respectively.
4.2 The Dynamic-QBRM ADP Algorithm
Before discussing the details, we need some additional notation. Clearly, at each time , the random quantity with which we are primarily concerned (and attempt to approximate) is the future cost given the optimal value function . Thus, we explicitly define its distribution function for every :
Recall that is the cardinality of the state-action space. Next, suppose is an approximation of and for each and , define the stochastic gradient mapping to perform the stochastic gradient computation:
To avoid confusion, we note that this is the stochastic gradient associated with the minimization formula (4.1) for CVaR, but this step is necessary for any QBRM, even if we are not utilizing CVaR.
The second piece of notation we need is a specialized, stochastic version of the Bellman operator to the risk-averse case: for each , we define the mapping , with arguments in , to represent an approximation of the term within the expectation of (4.3):
Therefore, (4.3) can be rewritten using the stochastic Bellman operator by replacing all approximate quantities with their true values:
The Dynamic-QBRM ADP algorithm that we describe in the next section consists of both outer and inner iterations: for each outer iteration , we step through the entire time horizon of the problem . At time , iteration , the relevant quantities for our algorithms are a state-action pair and two samples from the distribution of corresponding to the “two steps” of our algorithm, one for approximating the auxiliary variables and the second for approximating the value function . Figure 1 illustrates the main idea behind the algorithm: we merge the results of adaptive minimizations of (4.1), corresponding to estimates of the quantiles, into an estimate of the optimal value function, . The estimate is then used to produce estimates of the relevant quantities for the previous time period. Note that the objective functions shown in the figure differ only in their risk levels . The arrows on the curves indicate that the minimizations are achieved via gradient descent steps.
Now that we are in an algorithmic setting, we consider a new probability space , where . In order to describe the history of the algorithms, we define:
for and , with for all . We therefore have a filtration that obeys for and , coinciding precisely with the progression of the algorithm. The random variables are generated according to some sampling policy (to be discussed later) while and are generated from the distribution of the exogenous process .
Crucial to many ADP algorithms is the stepsize (or learning rate). In our case, we use and for smoothing new observations with previous estimates, where for each and and are -measurable. The stepsize is used to update our approximation of while the stepsize is used to update ; see Algorithm 1. We articulate the asynchronous nature of our algorithm by imposing the following condition on the stepsizes (included in Assumption 1 of Section 5):
which causes updates to only happen for states that we actually visit.
Stochastic approximation theory often requires a projection step onto a compact set (giving bounded iterates) to ensure convergence (Kushner and Yin, 2003). Hence, for each and , let and be compact intervals and let
be our projection sets at time . The Euclidean projection operator to a set is given by the usual definition:
These sets may be chosen arbitrarily large in practice and our first theoretical result (almost sure convergence) will continue to hold. However, there is a tradeoff: if, in addition, we want our convergence rate results to hold, then these sets also cannot be too large (see Assumption 3).
The precise steps of Dynamic-QBRM ADP are given in Algorithm 1. A main characteristic of the algorithm is that sequences and are intertwined (i.e., depend on each other). Consequently, there are multiple levels of approximation being used throughout the steps of the algorithm. The theoretical results of the subsequent sections shed light onto these issues.
5 Analysis of Convergence
In this section, we state and prove convergence theorems for Algorithm 1. First, we give an overview of our analysis and the relationship to existing work.
5.1 A Preview of Results
The two main results of this section are: (1) the almost sure convergence of Dynamic-QBRM ADP and (2) a convergence rate result under a particular sampling policy called -greedy. The proof of almost sure convergence uses techniques from the stochastic approximation literature (Kushner and Yin, 2003), which were applied to the field of reinforcement learning and -learning in Tsitsiklis (1994), Jaakkola et al. (1994) and Bertsekas and Tsitsiklis (1996). However, our algorithm differs from risk-neutral -learning in that it tracks multiple quantities, , over a horizon . The intuition behind the proof is that multiple “stochastic approximation instances” are pasted together in order to obtain overall convergence of all relevant quantities. Accordingly, the interdependence of various approximations means that in several parts of the proof, we require careful analysis of biased noise terms (or approximation error) in addition to unbiased statistical error. See, e.g., Kearns and Singh (1999), Even-Dar and Mansour (2004) and Azar et al. (2011), for convergence rate results for standard -learning. The proof technique used to analyze the high probability convergence rate of risk-neutral -learning in Even-Dar and Mansour (2004) is based on the same types of stochastic approximation results that we utilize in this paper.
Let us now make a few remarks regarding some simplifying assumptions made in this paper. As proven in (Rockafellar and Uryasev, 2002, Theorem 10), the set of minimizers is a nonempty, closed, and bounded interval for a general . We shall for ease of presentation, however, make assumptions (strictly increasing and continuous cdf, Assumption 2(iii)) to guarantee that is the unique minimizer when is the optimal future cost and that gradient computations to remain valid. This assumption is sufficient for almost sure convergence (Theorem 2). To further examine the convergence rate of the algorithm (Theorem 3), we must additionally have Assumption 3, which states that the density of the future cost exists and is positive within the constraint sets — this provides us the technical condition of strong convexity (discussed more in Section 5.3 below).
Since has a discrete distribution, the assumptions hold only in certain situations: an obvious case is when the current stage cost has a density and is independent of . For example, such a property holds when can be written as two independent components where the current stage cost depends on and the downstream state depends on . This model is relevant in a number of applications; notable examples include multi-armed bandits (Whittle, 1980), shortest path problems with random edge costs (Ryzhov and Powell, 2011), trade execution with temporary (and temporally independent) price impact (Bertimas and Lo, 1998), and energy trading in two-settlement markets (Löhndorf et al., 2013). Small algorithmic extensions (requiring more complex notation) to handle the general case are possible, but the fundamental concepts would remain unchanged. Hence, we will assume the cleaner setting for the purposes of this paper.
5.2 Almost Sure Convergence
First, we discuss the necessary algorithmic assumptions, many of which are standard to the field of stochastic approximation.
For all and , suppose the following are satisfied:
, for some that is -measurable,
, for some that is -measurable,
, such that state sampling policy satisfies
the projection sets are chosen large enough so that for each and .
Assumption 1(i) and (ii) represent the asynchronous nature of the algorithm, sending the stepsize to zero whenever a state is not visited, while (iii) and (iv) are standard conditions on the stepsize. Assumption 1(v) is an exploration requirement; by the Extended Borel-Cantelli Lemma (see Breiman (1992)), sampling with this exploration requirement guarantees that we will visit every state infinitely often with probability one. In particular, for the case with an -greedy sampling policy (i.e., explore with probability , follow current policy otherwise), this assumption holds. We discuss this policy in greater detail in Section 5. Part (vi) is a technical assumption. The second group of assumptions that we present are related to the problem parameters.
The following hold:
the risk-aversion function (from the QBRM within the one-step conditional risk measure ) is Lipschitz continuous with constant , i.e., for all , ,
such that for all and ,
the distribution function is strictly increasing and Lipschitz continuous with constant , i.e.,
for all , , and .
(ii) states that the second moment of the cost function is bounded. Assumption2(iii) and Assumption 1(vi) together imply that is the unique such that .
We distinguish between two types of noise sequences (in ) for each , the statistical error and the approximation error, denoted by and , respectively. The definitions are
and we see that the random variable represents the error that the sample gradient deviates from its mean, computed using the true future cost distribution (i.e., assuming we have ). On the other hand, is the error between the two evaluations of given the same sample , due only to the difference between and . Rearranging, we have
which implies that the update given in Step 4 of Algorithm 1 can be rewritten as
Note that the term in the square brackets is a biased stochastic gradient and observe that it is bounded (since only takes two finite values). For the present inductive step at time , let us fix a state . It now becomes convenient for us to view as a stochastic process in , adapted to the filtration (since ). It is clear that by the definition of :
Therefore, are unbiased increments that can be referred to as martingale difference noise. Before continuing, notice the following useful fact:
The proof follows from and , where the minimum and maximum are taken over the components of some vectors and . Now, let . Expanding the definition of and using (5.2), we obtain