 # Martingale Functional Control variates via Deep Learning

We propose black-box-type control variate for Monte Carlo simulations by leveraging the Martingale Representation Theorem and artificial neural networks. We developed several learning algorithms for finding martingale control variate functionals both for the Markovian and non-Markovian setting. The proposed algorithms guarantee convergence to the true solution independently of the quality of the deep learning approximation of the control variate functional. We believe that this is important as the current theory of deep learning functions approximations lacks theoretical foundation. However the quality of the deep learning functional approximation determines the level of benefit of the control variate. The methods are empirically shown to work for high-dimensional problems. We provide diagnostics that shed light on appropriate network architectures.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Control variate is one of the most powerful variance reduction techniques for Monte Carlo simulation. While a good control variate can reduce the computational cost of Monte Carlo computation by several orders of magnitude, it relies on judiciously chosen control variate functions that are problem specific. For example, when computing price of basket options a sound strategy is to choose control variates to be call options written on each of the stocks in the basket, since in many models these are priced by closed-form formulae. In this article, we are interested in black-box-type control variate approach by leveraging the Martingale Representation Theorem and artificial neural networks. The idea of using Martingale Representation to obtain control variates goes back at least to

. It has been further studied in combination with regression in  and .

This article uses the expressive power of deep neural networks, see [12, 23] by setting up the search for the martingale control variate as a learning task. Using the deep neural network as a control variate introduces no bias in the Monte Carlo computation. Importantly, this holds irrespective of quality of the deep learning approximation of the control variate functional. We believe that this is important as the current theory of deep learning functions approximations lacks theoretical foundation. More precisely:

1. Non-asymptotic analysis of deep neural network function approximation is in its infancy. Error estimates in terms of a number of layers and number of neurons in each layer are only known in very specific cases, see Grohs et al.

. In general, such results are extremely difficult to obtain, see [20, Chapter 16]. The results in  show that there is a good approximation to the Black–Scholes PDE solution with number of parameters in the artificial neural network growing only polynomially with dimension and accuracy. However, there is no method of finding the appropriate parameters for the desired accuracy. The search for the optimal parameters is a non-convex optimisation problem, usually tackled using gradient descent-type algorithm.

2. There is little theoretical underpinning for the use of stochastic gradient algorithms for minimisation of non-convex functions. Furthermore stochastic gradient algorithms, such as the ADAM method , that are commonly used for training, lack theoretical foundation even in the convex case.

3. Stability analysis for the weights in the network approximation is not yet developed. In fact, there are empirical examples that demonstrate that small perturbation in training data can lead to a dramatically different output of neural network approximation [32, 24].

What we propose in this work is a method for harnessing the power of deep learning algorithms in a way that is robust even in the case some deep learning algorithm will not deliver the expected quality.

We developed several learning algorithms for finding martingale control variate functionals. These work both in the Markovian and non-Markovian setting. For the former, see Section 4 and Algorithms 1, 2. Algorithm 1 was inspired by Cvitanic et. al.  and Weinan et. al, Han et. al. [38, 21]. For the non-Markovian setting see Section 5 and Algorithms 3 and 4). Section 6.2 describes exactly the network architecture and implementation details. We empirically test these methods on relevant examples including a 100 dimensional option pricing problems, see Examples 6.2 and 6.4. We carefully measure the training cost and report the variance reduction achieved. See Section 6 for details. Since we work in situation where the function approximated by neural network can be obtained via other methods (Monte-Carlo, PDE solution) we are able to test how the expressive power of fully connected artificial neural networks depends on the number of layers and neurons per layer. See Section 6.5 for details.

We would like to draw the reader’s attention to Figure 1. We see that the various algorithms work similarly well in this case (not taking training cost into account). We note that the variance reduction is close to the theoretical maximum which is restricted by time discretisation. Finally we see that the variance reduction is still significant even when the neural network was trained with different model parameter (in our case volatility in the option pricing example).

From the results in this article we conclude that the artificial neural networks can be used to provide efficient control variates. However, we observed that all the algorithms are sensitive to the network architecture, parameters and distribution of training data. A fair amount of tuning is required to obtain good results. Based on this we believe that there is great potential in combining artificial neural networks with already developed and well understood probabilistic computational methods. Figure 1. Variance reduction achieved by network trained with σ=0.3 but then applied in situations where σ∈[0.2,0.4]. We can see that the significant variance reduction is achieved by a neural network that was trained with “incorrect” σ. Note that the “MC + CV Margbrabe” displays the optimal variance reduction that can be achieved by using exact solution to the problem. The variance reduction is not infinite even in this case since stochastic integrals are approximated by Riemann sums.

## 2. Martingale control variate

Let

be a probability space and consider an

-valued Wiener process . We will use to denote the filtration generated by . Consider an -valued, continuous, stochastic process that is adapted to . We will use to denote the filtration generated by .

Let be a measurable function. We shall consider contingent claims of the form . This means that we can consider path-dependent derivatives. Finally, we assume that there are (stochastic) discount factor given by for an appropriate function and let

 ΞT:=D(t,T)G((Xs)s∈[0,T])∈L2(FT).

We now interpret as some risk-neutral measure and so the -price of our contingent claim is

 Vt=E[ΞT∣∣Ft]=E[D(t,T)G((Xs)s∈[t,T])∣∣∣Ft]. (1)

Say we have iid r.v.s with the same distribution as . Then the standard Monte-Carlo estimator is

 VNt:=1NN∑i=1ΞiT.

Convergence in probability as

is granted by the Law of Large Numbers. Moreover the classical Central Limit Theorem tells that

 P(Vt∈[VNt−zα/2σ√N,VNt+zα/2σ√N])→1−αasN→∞,

where and is such that with

being distribution function (cumulative distribution function) of the standard normal distribution. To decrease the width of the confidence intervals one can increase

, but this also increases the computational cost. A better strategy is to reduce variance by finding an alternative Monte Carlo estimator, say , such that

 E[VN|Ft]=VtandVar[VN|Ft]

and the cost of computing is similar to .

Martingale Representation Theorem, see e.g. [10, Th. 14.5.1], provides a generic (in a sense that it does not rely on a specific model) strategy for finding a Monte Carlo estimator with the above stated properties. Recall that by assumption is measurable and . Hence there exists a unique process adapted to with such that

 ΞT=E[ΞT∣∣FW0]+∫T0ZsdWs. (2)

Observe that in our setup, , for . Hence tower property of the conditional expectation implies that

 E[ΞT∣∣Ft]=E[ΞT∣∣FW0]+∫t0ZsdWs. (3)

Consequently (2) and (3) imply

 E[ΞT∣∣Ft]=ΞT−∫TtZsdWs.

We then observe that

 Vt=E[ΞT∣∣Ft]=E[ΞT−∫TtZsdWs∣∣∣Ft].

If we can generate iid and with the same distributions as and respectively then we can consider the following Monte-Carlo estimator:

 VNt:=1NN∑i=1(ΞiT−∫TtZisdWis).

The estimator has the following properties:

 E[VNt|Ft]=VtandVar[Vt|Ft]=0. (4)

Of course this on its own is of little practical use as, in general, there is no method to obtain the unique process .

In the remainder of the article we will devise and test several strategies, based on deep learning, to find a suitable approximation for the process by , , i.e one network for each . We will only require that are measurable and square integrable i.e. . The crucial feature of our approach is that

 Vθ,λ,Nt:=1NN∑i=1(ΞiT−λ∫Tt(Zθ)isdWis),

still has the property that , albeit the resulting variance typically will not be zero anymore. Note that is a parameter that can be chosen to reduce variance.

## 3. Artificial neural networks

We fix a locally Lipschitz function and for define as the function given, for by . We fix (the number of layers), , (the size of input to layer ) and (the size of the network output). A fully connected artificial neural network is then given by , where, for , we have real matrices and real

dimensional vectors

.

The artificial neural network defines a function given recursively, for , by

 (RΦ)(x0)=WLxL−1+BL,xk=Alk(Wkxk−1+Bk),k=1,…,L−1.

We can also define the function which counts the number of parameters as

 P(Φ)=L∑i=1(lk−1lk+lk).

We will call such class of fully connected artificial neural networks

. Note that since the activation functions and architecture are fixed the learning task entails finding the optimal

.

## 4. Learning the PDE solution

Under some simplifying assumptions there is a representation for the unique given by the martingale representation theorem in terms of solution of an associated PDE. We explore this connection here and then propose several learning algorithms that are designed to learn the solution of the associated PDE.

We will assume that is a Markov process that is given as the solution to

 dXs=b(s,Xs)ds+σ(s,Xs)dWst∈[t,T],Xt=x (5)

and that there is a function such that so that

 v(t,x):=E[D(t,T)g(XT)∣∣∣Xt=x].

### 4.1. PDE derivation of the control variate

It can be shown that under suitable assumptions on , , and that . See e.g. . Let . Then, from Feynman–Kac formula, we get

 {∂tv+tr(a∂2xv)+b∂xv−cv=0in [0,T)×D\,,v(T,⋅)=gon D. (6)

Since and since satisfies the above PDE, if we apply Itô’s formula then we obtain

 D(t,T)v(T,XT)=v(t,xt)+∫TtD(t,s)∂xv(s,Xs)σ(s,Xs)dWs. (7)

Hence Feynman-Kac representation together with the fact that yields

 EQ[D(t,T)g(XT)∣∣∣Ft]=D(t,T)g(XT)−∫TtD(t,s)∂xv(s,Xs)σ(s,Xs)dWs. (8)

This shows that we have an explicit form of the process in (2), provided that

 sups∈[t,T]E[|D(t,s)∂xv(s,Xs)σ(s,Xs)|2]<∞.

Thus we can consider the Monte-Carlo estimator.

 VN,vt:=1NN∑i=1{Di(t,T)g(XiT)−∫TtDi(t,s)∂xv(s,Xis)σ(s,Xis)dWis}.

To obtain a control variate we thus need to approximate

. If one used classical approximation techniques to the PDE, such as finite difference or finite element methods, one would run into the curse of the dimensionality - the very reason one employs Monte Carlo simulations in the first place. Artificial neural networks have been shown to break the curse of dimensionality in specific situations

. However there is no known method that can guarantee the convergence to the optimal artificial neural network approximation. The application of the deep-network approximation to the solution of the PDE as a martingale control variate is an ideal compromise.

Even if we had exact solution to the PDE (6) we would need to discretise the integrals that arise in to obtain an implementable algorithm. To that end take a partition of denoted

 π:={t=t0<⋯

and consider an approximation of (5) by . For simplicity we approximate all integrals arising by Riemann sums always taking the left-hand point when approximating the value of the integrand. Of course, more sophisticated quadrature rules are also available.

If there is no exact solution to the PDE (6), as would be the case in any reasonable application, then we will approximate by for . The implementable control variate Monte-Carlo estimator is then the form

 Vπ,θ,λ,Nt,T:=1NN∑i=1{(Dπ(t,T))ig((Xπ)iT)−λNsteps−1∑k=1(Dπ(t,tk))i(Rθ)(tk,Xitk)σ(tk,(Xπtk)i)(Witk+1−Witk)}, (10)

where and is a free parameter to be chosen (because we discretise and use approximation to the PDE it is expected ). Again we point out that the above estimator is unbiased independently of the choice . We will discuss possible approximation strategies for approximating with in the following section.

In this section we propose 4 algorithms that attempt the learn PDE solution (or its gradient) and then use it to build control variate.

### 4.2. Direct approximation of v by an artificial neural network

A first, and perhaps the most natural, idea to set up learning algorithm for the solution of the PDE (6) would be to use PDE (6) itself as score function. Let so that is an approximation of . One could then set a learning task as

 θ∗=argminθ∥∂t(Rθ)+tr(a∂2x(Rθ))+b∂x(Rθ)−c(Rθ)∥[0,T]×D+∥(Rθ)(T,⋅)−g∥D

in some appropriate norms and . Different choices of approximations of the derivatives and the norms would result in variants of the algorithm. Smoothness properties of would be important for the stability of the algorithm. The key challenge with the above idea is that it is not clear what the training data to learn should be (the domain is typically unbounded). For this reason we do not study this algorithm here and refer reader to for numerical experiments .

Before we proceed further we recall a well known property of conditional expectations, for proof see e.g. [27, Ch.3 Th. 14].

###### Theorem 4.1.

Let . Let be a sub

-algebra. There exists a random variable

such that

 E[|X−Y|2]=infη∈L2(G)E[|X−η|2].

The minimiser, , is unique and is given by .

The theorem tell us that conditional expectation is an orthogonal projection of a random variable onto .

### 4.3. Probabilistic representation based on Backward SDE

Instead of working directly with (6) we work with its probabilistic representation (7) and view it as a BSDE. To formulate the learning task based on this we recall the time-grid given by (9) so that we can write it recursively as

 v(tNsteps,XNsteps)=D(t,tNsteps)g(XNsteps),D(t,tm+1)v(tm+1,Xtm+1)=D(t,tm)v(tm,Xtm)+∫tm+1tmD(t,s)∂xv(s,Xs)σ(s,Xs)dWsform=0,1,…,Nsteps−1.

Next consider a deep network approximation

 (Rη)(tm,x)≈v(tm,x),m=0,1,…,N,x∈Rd

and

 (Rθ)(tm,x)≈∂xv(tm,x),m=0,1,…,(N−1),x∈Rd.

Approximation depends on weights , . We set the learning task as

 (11)

Note that in practice one would also work with and moreover any minimisation algorithm can only be expected to find which only approximate the optimal . The complete learning method is stated as Algorithm 1.

We remark that the idea of approximating PDEs formulated as optimisation problem and using probabilistic representation already appeared in  and was recently combined with neural network approximations in series of works [38, 21, 22]. The learning problem stated by these authors is somewhat similar to (11). However, in [38, 21, 22], the value function is only approximated at a single point whereas we obtain approximation of and for any .

### 4.4. Feynman-Kac and automatic differentiation.

Automatic differentiation can be used to provide an variant of the method of Section 4.3. Instead of using a as an approximation of we can use automatic differentiation to applied to so that . The learning algorithm is then identical to Algorithm 1 but with replaced with . Recently very similar ideas have been explored in .

### 4.5. Bismut-Elworthy-Li Formula

Consider the solution to (5) with , that is

 Xxis=xi+∫stbi(r,Xxr)dr+d∑j=1∫stσij(r,Xxr)dWjri=1,…,d. (12)

Define , for . Let be the matrix , (i.e. the Jacobian matrix) and let be the matrix , (i.e. the Jacobian of the map with fixed). It can be shown, see [28, Ch. 4], that the matrix valued process satisfies

 dYs=∂xb(s,Xs)Ysds+d∑j=1∂xσ⋅j(s,Xs)YsdWjs,Yt=I, (13)

where

is identity matrix. Bismut-Elworthy-Li formula that we state next, provides probabilistic representation for gradient of the solution to the PDE (

6). This is in the same vein as Feynman-Kac formula provides probabilistic representation to the solution of (6).

###### Theorem 4.2 (Bismut-Elworthy-Li formula).

Fix . Let be continuous deterministic function such that . Then

 ∂xv(t,x)=E[D(t,T)g(XT)∫Tta(s)(σ(X(s)))−1Y(s)dWs∣∣∣Xt=x].

We refer reader to [15, Th. 2.1] or [13, Th. 2.1] for the proof. Let us point out that in the above representation no derivative of is needed. This makes it appealing in financial applications for calculating greeks in particular . In the case that is sufficiently well behaved (so that the stochastic integral is a true martingale) we have . Thus one can see that

 ∂xv(t,x)=E[(D(t,T)g(XT)−g(Xt))∫Tta(s)(σ(X(s)))−1Y(s)dWs∣∣∣Xt=x].

The resulting Monte Carlo estimator may enjoy reduced variance property, . To build a learning algorithm we use this theorem with and define

 X:=(D(t,T)g(XT)−D(t,t)g(Xt))1T−t∫Tt(σ(X(s)))−1Y(s)dWs

so that . Hence, by Theorem 4.1,

 E[|X−∂xv(t,Xt)|2]=infη∈L2(σ(Xt))E[|X−η|2]

and we know that for a fixed the random variable which minimises the mean square error is a function of . But by the Doob–Dynkin Lemma [10, Th. 1.3.12] we know that every can be expressed as for some appropriate measurable . For the practical algorithm we restrict the search for the function to the class that can be expressed as deep neural networks . Hence we consider a family of functions with and set learning task as

 (14)

In practice, one employs a variant of stochastic gradient algorithm, see for classical exposition on the topic [29, 7] and for more recent development . We use the the following notation

 θ⋄:=ˆargminθE[|X−(Rθt)(Xt)|2],

where indicates that an approximation is used to minimise the function. Algorithm 2 describes the method including time discretization.

## 5. Learning martingale control variate functional

Let us go back to the general setting of Section 2. In particular this means that the contingent claim that we wish to know the price of at time is given by and so is possibly path-dependent.

In this situation we cannot express the process arising from Martingale representation theorem, see (2), in terms of a derivative of the associated PDE on . Nevertheless we can set up a learning task that tries to approximate this with a deep network. To that end we recall the time grid given by (9), the associated discretisation of the process which is and the discrete discount factors . We then propose to approximate at time using a deep networks with inputs being the entire discrete path up to time i.e.

 Ztk≈Dπ(t,tk)(Rθtk)((Xπtj)kj=0)σ(tk,(Xπtk). (15)

As always represent the deep network weights that have to be chosen appropriately. The implementable control variate Monte-Carlo estimator now has the form

 Vπ,θ,λ,Nt,T:=1NN∑i=1Vπ,θ,λ,it,T,whereVπ,θ,λ,it,T:=(Dπ(t,T))iG(((Xπtk)Nstepsk=0)i)−λMi,θtk,TandMi,θtk,T:=Nsteps% −1∑k=1(Dπ(t,tk))i(Rθtk)(((Xπtj)kj=0)i)σ(tk,(Xπtk)i)ΔWitk+1. (16)

We now make several remarks:

1. We are not assuming that comes as a solution to an SDE here, it only needs to be a continuous adapted process (e.g. McKean–Vlasov SDE arising in local stochastic volatility models). However in practice one needs to be able to simulate this process to be able to set up the learning task.

2. It seems natural use (15) as an approximation to but in fact a direct approximation might perform equally well.

3. It has been advocated in [4, 5] that better computational result can be obtained by using discrete time version of martingale representation. Theorem 2.1 in  tell us that provided

 D(tj,T)G((Xπtk)Nstepsk=j)=E[D(tj,T)G((Xπtk)N%stepsk=j)]+Mti,T, (17)

where is discrete-time square integrable martingale. In  a representation for is given using an (infinite) series of Hermite Polynomial. Such results in literature are known as discrete-time analogue of Clark-Ocone formula [35, 1].

The representation (17) provides another route to deriving (16): discretise the time first and then apply the discrete martingale representation as opposed to first applying the martingale representation theorem and then discretising time. Indeed we effectively have that

 Mi,θt,T=Nsteps−1∑k=1(Dπ(t,tk))i(Rθtk)(((Xπtj)kj=0)i)σ(tk,(Xπtk)i)ΔWitk+1.

There are other possibilities for the choice of the form of . One could work with discrete time analogue of Clark-Ocone formula as already mentioned [4, 35, 1]. Alternatively one could build control variates using so called Stein operators as in [33, 6]. We leave it for the future research to explore this other possibilities.

4. This methods is effectively learning the hedging hedging strategy, see also .

5. Even though we said that there is no representation as a solution to some PDE on , there is a representation in terms of a path-dependent PDE, see .

6. We write that the network approximation depends on the entire path of up to . For practical learning one would typically use increments as inputs. There is also evidence that using iterated integrals as learning inputs is very efficient .

We will now formulate the the search for the control variate as learning tasks.

### 5.1. Empirical risk minimisation

Recall definition of given by (16). From (4) we know that the theoretical control variate Monte-Carlo estimator has zero variance and so it is natural to set-up a learning task which aims to learn the network weights in a way which minimises said variance:

 θ⋆,var:=argminθVar% [Vπ,θ,λ,Nstepst,T].

Setting , the learning task is stated as Algorithm 3.

Note that in this case we learn the control variate by setting . In the next method we show that in fact there is a way learn control variate with optimal by directly estimating it.

### 5.2. Empirical correlation maximisation

This method is based on the idea that since we are looking for a good control variate we should directly train the network to maximise the variance reduction between the vanilla Monte Carlo estimator and the control variates Monte Carlo estimator by also trying to optimise .

Recall that . The optimal coefficient that minimises the variance is

 λ∗,θ=Cov[ΞT,Mθt,T]Var[Mθt,T].

Let denote the Pearson correlation coefficient between and i.e.

 ρΞT,Mθt,T=Cov(ΞT,Mθt,T)√Var[ΞT]Var[Mθt,T].

With the optimal