Robust Risk-Aware Reinforcement Learning

by   Sebastian Jaimungal, et al.

We present a reinforcement learning (RL) approach for robust optimisation of risk-aware performance criteria. To allow agents to express a wide variety of risk-reward profiles, we assess the value of a policy using rank dependent expected utility (RDEU). RDEU allows the agent to seek gains, while simultaneously protecting themselves against downside events. To robustify optimal policies against model uncertainty, we assess a policy not by its distribution, but rather, by the worst possible distribution that lies within a Wasserstein ball around it. Thus, our problem formulation may be viewed as an actor choosing a policy (the outer problem), and the adversary then acting to worsen the performance of that strategy (the inner problem). We develop explicit policy gradient formulae for the inner and outer problems, and show its efficacy on three prototypical financial problems: robust portfolio allocation, optimising a benchmark, and statistical arbitrage



There are no comments yet.


page 1

page 2

page 3

page 4


Reinforcement Learning with Dynamic Convex Risk Measures

We develop an approach for solving time-consistent risk-sensitive stocha...

Risk Averse Robust Adversarial Reinforcement Learning

Deep reinforcement learning has recently made significant progress in so...

Policy Gradient with Expected Quadratic Utility Maximization: A New Mean-Variance Approach in Reinforcement Learning

In real-world decision-making problems, risk management is critical. Amo...

Entropic Risk Constrained Soft-Robust Policy Optimization

Having a perfect model to compute the optimal policy is often infeasible...

CARL: Conditional-value-at-risk Adversarial Reinforcement Learning

In this paper we present a risk-averse reinforcement learning (RL) metho...

Risk-Averse Offline Reinforcement Learning

Training Reinforcement Learning (RL) agents in high-stakes applications ...

Robust Reinforcement Learning with Wasserstein Constraint

Robust Reinforcement Learning aims to find the optimal policy with some ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many problems in financial mathematics, economics, and engineering may be cast in the form of a stochastic optimisation problem, and typically the agent’s optimal control depends on the underlying (dynamic) model assumptions. Models, however, are approximations of the world, whether they are purely data driven (i.e., empirical) models, parametric models estimated from data, or models that are posited to reflect the given stochastic dynamics. As models are approximations, understanding how to protect ones decisions from uncertainty inherent in a model is of paramount importance. Thus, here we consider robust stochastic optimisation problems where the agent chooses the action that is optimal under the worst case within an uncertainty set.

In many contexts, and particularly so in financial modelling, it is important to account for risk. Using expected utility of rewards is one approach for trading off risk and reward, however, there are many models of decision under uncertainty that go beyond it. Here, we take the ranked dependent expected utility (RDEU) framework of [Yaari1987Economitrica]

which allows agents to account not only for the concavity in their utility, but also allows them to distort the probabilities of outcomes to have a more fulsome reflection of their risk preferences.

While only specific examples of robust stochastic optimisation problems admit (semi-) analytical solutions or are numerically tractable, a general framework for solving robust stochastic optimisation problems is still missing and that is the focus of this paper. Hence, we develop a reinforcement learning (RL) approach for solving a general class of robust stochastic optimisation problems, where agents aim to minimise their risk – measured by RDEU – subject to model uncertainty; thus robustifying their actions.

In our setting, an agent’s action induces a univariate controlled random variable (rv) which is subject to distributional uncertainty, modelled via the Wasserstein distance. Notable is that in our setting, while the uncertainty is on the controlled rv and the alternative distributions lie within a Wasserstein ball around it, the alternate distributions may also have other structural constraints.

The vast majority of the literature on RL considers maximising expected total reward, while in many contexts, and particularly so in financial modelling, it is important to account for risk. While risk-aware RL often focuses on expected utility, recently [tamar2015policy] extended policy gradient methods to account for coherent measures of risk. Here, however, we are interested in RDEU measures of risk that falls outside the class of coherent risk measures.

A (risk-neutral) distributionally robust RL approach for Markov decision processes, where robustness is induced by looking at all transition probabilities (from a given state) that have relative entropy with respect to (wrt) a reference probability less than a given epsilon, is developed in

[smirnova2019distributionally]. [abdullah2019wasserstein] develops a (risk-neutral) robust RL paradigm where policies are randomised with a distribution that depends on the current state, see [wang2020reinforcement] for a continuous time version of randomised policies with entropy regularisation and [guo2020entropy, firoozi2020exploratory] for its generalisation to mean-field game settings. In [abdullah2019wasserstein], the uncertainty is placed on the conditional transition probability from old state and action to new state, and the set of distributions are those that lie within an “average” -Wasserstein ball around a benchmark model’s distribution. As randomised policies are used, the constraint and policy decouple. In this work, there is no such decoupling and we work mostly with deterministic policies that map states to actions in a unique manner.

As far as the authors are aware, this paper fills two gaps in the literature. The first is the incorporation of RDEU measures of risk to RL problems, and the second is robustifying risk-aware RL. We fill these gaps by posing a generic robust risk-aware optimisation problem, develop policy gradient formulae for numerically solving it, and illustrate its tractability on three prototypical examples in financial mathematics.

The remainder of this paper is structured as follows: Section 2 introduces the robust stochastic optimisation problem. Section 3 provides the RL policy gradient formulae for the inner and outer problems. Section 4 illustrates the tractability of the RL framework on three examples: robust portfolio allocation, optimising a benchmark, and statistical arbitrage.

2 Robust Optimisation Problems

We consider agents who measure the risk/reward of a rv using Yaari’s dual theory [Yaari1987Economitrica]. As Yaari argues, agents not only value outcomes according to an utility function, but also view probabilities of outcomes subjectively and thus distort them. This leads to the notion of rank dependent expected utility (RDEU) defined below.

Definition 1 (Rdeu)

The RDEU of a rv may be defined via a Choquet integral as


where is an increasing function with and , called distortion function, and is a non-decreasing concave utility function. We assume that is differentiable almost everywhere.

The above definition assumes that positive outcomes correspond to gains and negative ones to losses. The RDEU framework subsumes the class of distortion risk measures, for , which includes the well-known Conditional-Value-at-Risk (CVaR), see Section 4. Moreover, it includes the expected utility framework when , in which case . Throughout, we refer to the RDEU of a rv as the rv’s risk.

We consider the situation where an agent’s action induces a rv and that the agent aims to minimise the risk associated with , i.e. . However, due, to the presence of model uncertainty – distributional uncertainty on – the agent, instead of choosing actions with minimal risk, chooses the action that minimises the worst-case risk of all alternative rvs , where belongs to an uncertainty set , which may depend on the agents’ action . Specifically, the agent aims to solve the robust optimisation problem


where the admissible set of controls , is a controlled -valued rv, is an -valued rv parametrised by , parametrises the robustness set, and denotes the -Wasserstein distance between two rvs and , defined below. Problem (P) is only fully specified once the mappings and are given. The proposed RL approach allows for flexibility in these mappings, thus we make here, apart from the existence of a solution to (P), no further assumption on them. We may interpret (P) as an adversarial attack, where the agent picks an action, and an adversary distorts to have as worst performance as possible within a given Wasserstein ball. Below we provide several examples of problem (P) which we revisit in Section 4.

Recall that the Wasserstein distance of order between and , is given explicitly by (see e.g., [ambrosio2003lecture], Chap. 1)


where is the set of all bivariate probability measures with marginals and . The -Wasserstein distance defines a metric on the space of probability measures.

The robust stochastic optimisation problem (P) is a generalisation of distributional robust optimisation, where the uncertainty set is a subset of the space of distribution functions only, see e.g., [esfahani2018data, Bernard2020robust]. Here, however, the uncertainty set possesses additional features in that it (a) may depend on the agent’s action , (b) the rv may have a structure induced by , in which case not all rv within a Wasserstein distance around are feasible rvs, and (c) the set of feasible parameters belong to a set , which may impose additional constraints on .

Problem (P) performs a robust optimisation (over ) of as follows. Given from the “outer” problem, the “inner” problem corresponds to a robust version of ’s risk. As , the inner problem reduces to the RDEU of . When , however, the agent incorporates model uncertainty, and instead assesses the risk associated with by seeking over all alternate rv, generated by , that lie within a Wasserstein ball around it.

Example 1 (Robust Portfolio Allocation.)

Suppose that is the probability simplex in -dimensions, for we write , and represents the returns of traded assets. Further, let and write , where is an artificial neural net (ANN) parameterised by . In this setup, the inner problem corresponds to seeking over all distribution functions, that may be generated by the ANN and that lie within a Wasserstein ball around ; thus, the inner problem results in a robust estimate of the risk of . The outer problem then seeks to find the best investment that is robust to model uncertainty. [pflug2007ambiguity, esfahani2018data] investigate a similar class of problems, however, the uncertainty ball is on the inputs and not the output and they use coherent/convex risk measures as measures of risk compared to RDEU.

Example 2 (Optimising Risk-Measures with a Benchmark.)

Suppose that is a singleton, the components of denote the percentage of wealth to invest in various assets, and denotes the terminal value of such an investment. Then, may be interpreted as benchmark strategy that the investor wishes to outperform in terms of RDEU.

Let parameterise a dynamic self-financing trading strategy (e.g., parameters in an ANN that map time and asset prices to trading positions) whose terminal value is . If we replace in the inner problem in (P) the with , the corresponding problem is to find a dynamic strategy that has the best risk of all portfolios within a Wasserstein ball around the benchmark. This example generalises [pesenti2020portfolio] to the case of RDEU and also applies to incomplete markets.

Example 3 (Robust Dynamic Trading Strategy.)

Consider the case where denotes the price path of an asset at (trading) time points , and denotes the shares bought/sold at the sequence of trading times. For any , the terminal wealth from the sequence of trades is


where is the total assets held at time . Further, we set , where is an ANN parametrised by . As in Example 1 this corresponds to an agent who aims to minimise over a robust measure of risk of . A related work is [cartea2017algorithmic] who consider robust algorithmic trading problems using relative entropy penalisations under linear utility.

In the next section we derive the policy gradient formulae for the inner and outer problem.

3 Policy Gradients

Policy gradient methods provide a sequence of policies/actions that improve upon one another by taking steps in the direction of the value function’s gradient, where the gradient is taken wrt the parameters of the policy. In this section, we derive policy gradient update rules for optimising (P) over both and . In Section 3.3, we provide a policy gradient formula when the agent controls not the action itself, but rather its distribution. Such actions are also referred to as relaxed controls, see [wang2020reinforcement, firoozi2020exploratory, guo2020entropy].

3.1 The Inner Problem

First, we study the inner problem of (P). To do so, we employ an augmented Lagrangian approach to incorporate the constraints. For this, we fix the rv and denote by and

its corresponding distribution, respectively, quantile function. We further denote the distribution and quantile function of

by and , respectively. The augmented Lagrangian may then be written as


where is the -Wasserstein constraint error, denotes the positive part of , is the Lagrange multiplier that enforces this constraint, and the penalty constraint. The augmented Lagrangian approach fixes and , minimises/maximises

, e.g., by using stochastic gradient descent (SGD), then updates

and with some . For an overview of the augmented Lagrangian approach see, e.g., [birgin2014practical], Chap. 4.

While the augmented Lagrangian may be estimated from a mini-batch of simulations, optimising over the parameters requires gradients wrt . Many widely used risk measures, such as CVaR, RVaR, UTE, however, admit a derivative (of ) that has discontinuities, and whenever the derivative of has discontinuities, naïve back-propagation will incorrectly estimate its gradient. To overcome these potential discontinuities, we derive a gradient formula that can be estimated using mini-batch samples.

Proposition 1 (Inner Gradient Formula.)

Let denote the version of that makes comonotonic – i.e., reorder the realisations of according to the rank of . If is left-differentiable, then


where is given by , denotes the derivative from the left, the constant , and is the density of .

The gradient formula (5) requires estimating the function . For this purpose, suppose we are given a mini-batch of data of

, which, e.g., may be the result of an accumulation of multiple sources of randomness (such as in dynamic trading). We then make a kernel density estimator (KDE)

of given by


where denotes the distribution function for an appropriate (zero-centred and standardised) kernel (e.g., Gaussian). Therefore,


where is the kernel’s corresponding density. As the samples are viewed as outputs of an ANN, the gradients may be efficiently obtained using standard back-propagation techniques.

Inserting the KDE into (5), we may estimate the gradient by


where are the reordered realisations of , such that they are comonotonic with . The Wasserstein distance between and may be approximated using the same mini-batch as , see e.g., [ambrosio2003lecture][Chapter 1.].

3.2 The Outer Problem

Similar to the inner problem, optimisation for the outer problem is carried out using the augmented Lagrangian, this time taking gradients wrt . To calculate the derivatives, we must specify how is generated. Specifically, we assume where is another (multi-dimensional) source of randomness.

Proposition 2 (Outer Gradient Formula.)

Let denote the version of which makes comonotonic – i.e., reorder the realisations of according to the rank of – then the gradient becomes


where the constant and is the density of .

As in the previous section, given a mini-batch of , we may estimate the gradient by


where we use for simplicity the same kernel for and . The gradients and may be computed using the relationship and back-propagation. Algorithm 1 provides an overview of the optimisation methodology.

3.3 Randomised Policies

Figure 1: Graphical model representation of randomised policies.

There are many instances where optimising over randomised (also known as probabilistic) policies allows one to explore the state-action space better (and, hence, obtain better model estimates) or randomised policies are all that is viable. As such, we briefly discuss how the results in Propositions 1 and 2 may be applied in the randomised policy case. For example, it is often the case that the terminal rv (of the outer problem) stems from a sequence of actions that are conditionally generated from the previous system states , where , as in the graphical model in Figure 1

. Hence, the probability density function (pdf) over the sequence of state/action pairs admits the decomposition


where specifies the conditional one-step transition densities, is the prior on , and the pdf of actions conditioned on states. To compute the gradient we use the above decomposition and note that


Therefore, its gradient becomes

In the last line, should be understood as rvs corresponding to the outputs of all nodes in the graphical model in Figure 1. Thus, using a KDE approximation from samples of state-action sequences , we may estimate the gradient by


To obtain a more explicit form, one must specify how actions are drawn using a given policy, e.g., they may be normally distributed with mean and standard deviation parameterised by an ANN. The remaining gradient

may then be computed using back-propagation along the sampled mini-batch of paths. Similar calculations can be performed to derive formulae for , note that does not have a gradient wrt actions.

1 initialise networks ; initialise Lagrangian multipliers , , ; for  to  do
2          Simulate mini-batch of ; for  to  do
3                   Simulate mini-batch of using fixed in outer loop; Estimate inner gradient using (8); Update network using a ADAM step; if  then
4                            Update multipliers: and ;
5                   end if
6                  Repeat until , and has not increased for the past 100 iterations;
7          end for
8         Simulate mini-batch of from and trained network; Estimate outer gradient using (10); Update network using a ADAM step; Repeat until has not decreased for the past 100 iterations;
9 end for
Algorithm 1 Schematic of optimisation algorithm.

4 Examples

Here, we illustrate the three prototypical examples described earlier. For this, the investor’s RDEU is a combination of a linear utility and an - distortion given by:


with normalising constant , , and . This parametric family is -shaped (i.e., -shaped RDEU), which is well-known to account for the investor’s loss avoiding while simultaneously risk-seeking behaviour, and contains several notable risk measures as special cases. For , it reduces to the CVaR at level , for , to the upper tail expectation (UTE) at level , and for (), it emphasises losses (gains) relative to gains (losses). For all experiments, unless otherwise stated, we use , , and to showcase how investors protect themselves from downside risk while still seeking gains.

In the examples below, before computing the outer gradient, we ensure that constraints of the inner problem are binding, so that in (9). Furthermore, while it is easy to incorporate transaction costs in all of these examples, we opted to exclude them for simplicity of the settings. Example code may be found at

4.1 Robust Portfolio Allocation

In this subsection, we illustrate the results on a problem introduced in Example 1. We take the setup from [esfahani2018data] where the market consists of -assets whose returns are driven by a systematic factor and idiosyncratic factors , , where the factors are mutually independent. The individual returns are and the total return is .

Figure 2: Robust portfolio allocations as the size of the Wasserstein ball varies. (Left) densities of terminal wealth, and (right) percentage of wealth held in each asset. Dashed vertical lines: and .

Such a model can easily be generalised to include several systemic factors. We model the outer strategy

as an ANN that maps a zero tensor directly to the asset weights

through a softmax activation function (to avoid short-selling:

, , ) of the learned bias. We use the Wasserstein distance of order .

Figure 2 illustrates the optimal terminal wealth (left panel) and percentage of investment as a function of the size of the Wasserstein ball (right panel), for assets. For larger , the investor seeks more robustness, which is illustrated in the left panel that shows that all () move to the left as increases. Specifically, for equal to ,, and , we have , , and and (, , ). These statistics indicate that the investor becomes more and more conservative with increasing Wasserstein distance. This is further emphasised in the right panel, where, for small , the investor puts most of their wealth in the riskier assets, and as increases, moves closer to an equally weighted portfolio. For completeness, the () of the worst-case distribution around the optimal , as varies from ,, are , , and (, , ), respectively.

4.2 Optimising Risk-Measures with a Benchmark

Next, we illustrate our RL approach on a portfolio allocation problem where an agent aims to improve upon a benchmark strategy, as described in Example 2. The outer problem has a singleton corresponding to a benchmark strategy – which we take to be a constant proportion of wealth strategy (any other benchmark strategy would do) – and the agent aims to seek over alternate strategies that minimise RDEU, i.e. replacing sup with inf in the inner problem.

We optimise in discrete time and consider an investor who chooses from the set of admissible strategies consisting of -adapted Markov processes that are self-financing and in . For an arbitrary , where , , represents the percentage of wealth invested in each asset, the investor’s wealth process

satisfies the usual self-financing equation. To illustrate the flexibility of our formulation, we use a stochastic interest rate model combined with a constant elasticity of variance (SIR-CEV) market model. The details of the market model dynamics and parameters may be found in

[pesenti2020portfolio]. Here, we use the Wasserstein distance of order .

We use a fully connected feed-forward neural network with 3 hidden layers of 50 neurons each with ReLU activation functions to map the state consisting of the features

to policy . The final output layer may be chosen to reflect constraints on the portfolio weights – e.g., long only weights with no leveraging, in which case one would use a sigmoid activation function. For the results shown here we have no activation for the final layer and thus allow the agent to take long, short, and leveraged positions.

Figure 3: Illustrations of the terminal wealth rv of the benchmark’s and of the optimal’s . Vertical lines indicate the and .

Figure 3 illustrates the resulting optimal portfolio for a constant proportion benchmark strategy . The left panel shows the density of the benchmark and optimal terminal wealth. The optimal pulls mass from the left tails into a spike near and pushes mass into the right tail. This reflects the investor’s risk preferences. The right scatter plot shows the state-by-state dependence between the optimal and the benchmark. The results are qualitatively similar to those derived in [pesenti2020portfolio], even though here the problem is posed in discrete time. While the analytical approach in [pesenti2020portfolio] applies only for complete market models, the proposed RL approach is also applicable to incomplete markets.

4.3 Robust Statistical Arbitrage

In this subsection, we explore Example 3. In this case, the outer strategy denotes the position the trader holds at discrete times . We assume the asset price process is an Ornstein-Uhlenbeck process satisfying with , , , , and . The outer strategy

is determined by a fully connected feed-forward neural network with 3 hidden layers of 50 neurons each with ReLU activation functions, and takes the state consisting of the features

as input. The final output layer is chosen to reflect the constraints on the inventory. Here, we use a activation function to constrain the strategy such that inventory remains in the interval . We use the Wasserstein distance of order .

Figure 4: Optimal statistical arbitrage strategy’s terminal densities (left) and corresponding worst-case densities (right). Vertical dashed lines show the corresponding .

Figure 4 show results for the - risk measure for (), , and . The left panel shows the optimal densities as varies and the right panel the corresponding worst-case densities that result from solving the inner problem with the optimal as the reference distribution. For increasing , the agent puts more and more weight on the upper tail and the optimal distribution becomes more profitable, but also more risky (the moves to the left). In Figure 5, we illustrate the optimal execution strategies at time through a heat map and as a function of current inventory and asset price. The colours indicate the optimal trade – e.g., a location in deep red indicates short selling of units of the asset. As decreases the agent becomes more gain-seeking and starts taking more aggressive actions to take advantage of the mean-reversion of asset prices.

Figure 5: Optimal statistical arbitrage strategy at for various values of of the - risk measure.

5 Conclusions and Future Work

We pose a generic robust risk-aware optimisation problem, develop a policy gradient approach for numerically solving it, and illustrate its tractability on three prototypical examples in financial mathematics. While the approach appears to work well on a collection of different problems, there are several avenues still open for investigation, such as under what conditions on the RDEU, the controlled rv , and the uncertainty set , is the problem well-posed, as well as establishing the convergence for the policy gradient method itself. We believe that the generality of our proposed RL framework opens doors to help solving a host of other problems, including, e.g., robust hedging of derivative contracts and robustifying optimal timing of irreversible investments. One further issue worth illuminating, is that, while the approach applies to dynamic decision making (such as in Examples 1 and 2), as RDEU is not a dynamically time consistent risk measure, the optimal strategies may not be time consistent. Hence, there is also need for developing an RL approach for robustifying time-consistent dynamic risk measures.


Appendix A Proofs

Proof (Proof of Proposition 1)

First, we prove the following lemma.

Lemma 1 (Representation of RDEU)

If the distortion function is left-differentiable, then the RDEU admits representation


where is given by , and denotes the derivative from the left.

Proof (Proof of Lemma 1)

Using first integration by parts, then the properties of , we obtain


Next, noting that and that by assumption is left-differentiable, we obtain


where we used the change of variable .

Next, as the cost functional for the -Wasserstein distance in one-dimension is submodular, we may write [ambrosio2003lecture][Chapter 1.]


Using 1, we may represent the gradient of the augmented Lagrangian as


where . As , taking gradients wrt provides us with


Further, if , then . The result follows immediately on substituting these expressions and interpreting the integral over as expectation over a uniform rv.

Proof (Proof of Proposition 2)

The proof follows along the same lines as Proposition 1, and uses the envelope theorem [milgrom2002envelope] to evaluate the gradient wrt as gradient of the Lagrangian evaluated at the saddle point obtained by the inner problem, and is omitted for brevity.