Control variate is one of the most powerful variance reduction techniques for Monte Carlo simulation. While a good control variate can reduce the computational cost of Monte Carlo computation by several orders of magnitude, it relies on judiciously chosen control variate functions that are problem specific. For example, when computing price of basket options a sound strategy is to choose control variates to be call options written on each of the stocks in the basket, since in many models these are priced by closed-form formulae. In this article, we are interested in black-box-type control variate approach by leveraging the Martingale Representation Theorem and artificial neural networks. The idea of using Martingale Representation to obtain control variates goes back at least to. It has been further studied in combination with regression in  and .
This article uses the expressive power of deep neural networks, see [12, 23] by setting up the search for the martingale control variate as a learning task. Using the deep neural network as a control variate introduces no bias in the Monte Carlo computation. Importantly, this holds irrespective of quality of the deep learning approximation of the control variate functional. We believe that this is important as the current theory of deep learning functions approximations lacks theoretical foundation. More precisely:
Non-asymptotic analysis of deep neural network function approximation is in its infancy. Error estimates in terms of a number of layers and number of neurons in each layer are only known in very specific cases, see Grohs et al.. In general, such results are extremely difficult to obtain, see [20, Chapter 16]. The results in  show that there is a good approximation to the Black–Scholes PDE solution with number of parameters in the artificial neural network growing only polynomially with dimension and accuracy. However, there is no method of finding the appropriate parameters for the desired accuracy. The search for the optimal parameters is a non-convex optimisation problem, usually tackled using gradient descent-type algorithm.
There is little theoretical underpinning for the use of stochastic gradient algorithms for minimisation of non-convex functions. Furthermore stochastic gradient algorithms, such as the ADAM method , that are commonly used for training, lack theoretical foundation even in the convex case.
What we propose in this work is a method for harnessing the power of deep learning algorithms in a way that is robust even in the case some deep learning algorithm will not deliver the expected quality.
We developed several learning algorithms for finding martingale control variate functionals. These work both in the Markovian and non-Markovian setting. For the former, see Section 4 and Algorithms 1, 2. Algorithm 1 was inspired by Cvitanic et. al.  and Weinan et. al, Han et. al. [38, 21]. For the non-Markovian setting see Section 5 and Algorithms 3 and 4). Section 6.2 describes exactly the network architecture and implementation details. We empirically test these methods on relevant examples including a 100 dimensional option pricing problems, see Examples 6.2 and 6.4. We carefully measure the training cost and report the variance reduction achieved. See Section 6 for details. Since we work in situation where the function approximated by neural network can be obtained via other methods (Monte-Carlo, PDE solution) we are able to test how the expressive power of fully connected artificial neural networks depends on the number of layers and neurons per layer. See Section 6.5 for details.
We would like to draw the reader’s attention to Figure 1. We see that the various algorithms work similarly well in this case (not taking training cost into account). We note that the variance reduction is close to the theoretical maximum which is restricted by time discretisation. Finally we see that the variance reduction is still significant even when the neural network was trained with different model parameter (in our case volatility in the option pricing example).
From the results in this article we conclude that the artificial neural networks can be used to provide efficient control variates. However, we observed that all the algorithms are sensitive to the network architecture, parameters and distribution of training data. A fair amount of tuning is required to obtain good results. Based on this we believe that there is great potential in combining artificial neural networks with already developed and well understood probabilistic computational methods.
2. Martingale control variate
be a probability space and consider an-valued Wiener process . We will use to denote the filtration generated by . Consider an -valued, continuous, stochastic process that is adapted to . We will use to denote the filtration generated by .
Let be a measurable function. We shall consider contingent claims of the form . This means that we can consider path-dependent derivatives. Finally, we assume that there are (stochastic) discount factor given by for an appropriate function and let
We now interpret as some risk-neutral measure and so the -price of our contingent claim is
Say we have iid r.v.s with the same distribution as . Then the standard Monte-Carlo estimator is
Convergence in probability as
where and is such that with, but this also increases the computational cost. A better strategy is to reduce variance by finding an alternative Monte Carlo estimator, say , such that
and the cost of computing is similar to .
Martingale Representation Theorem, see e.g. [10, Th. 14.5.1], provides a generic (in a sense that it does not rely on a specific model) strategy for finding a Monte Carlo estimator with the above stated properties. Recall that by assumption is measurable and . Hence there exists a unique process adapted to with such that
Observe that in our setup, , for . Hence tower property of the conditional expectation implies that
We then observe that
If we can generate iid and with the same distributions as and respectively then we can consider the following Monte-Carlo estimator:
The estimator has the following properties:
Of course this on its own is of little practical use as, in general, there is no method to obtain the unique process .
In the remainder of the article we will devise and test several strategies, based on deep learning, to find a suitable approximation for the process by , , i.e one network for each . We will only require that are measurable and square integrable i.e. . The crucial feature of our approach is that
still has the property that , albeit the resulting variance typically will not be zero anymore. Note that is a parameter that can be chosen to reduce variance.
3. Artificial neural networks
We fix a locally Lipschitz function and for define as the function given, for by . We fix (the number of layers), , (the size of input to layer ) and (the size of the network output). A fully connected artificial neural network is then given by , where, for , we have real matrices and real
The artificial neural network defines a function given recursively, for , by
We can also define the function which counts the number of parameters as
We will call such class of fully connected artificial neural networks
. Note that since the activation functions and architecture are fixed the learning task entails finding the optimal.
4. Learning the PDE solution
Under some simplifying assumptions there is a representation for the unique given by the martingale representation theorem in terms of solution of an associated PDE. We explore this connection here and then propose several learning algorithms that are designed to learn the solution of the associated PDE.
We will assume that is a Markov process that is given as the solution to
and that there is a function such that so that
4.1. PDE derivation of the control variate
It can be shown that under suitable assumptions on , , and that . See e.g. . Let . Then, from Feynman–Kac formula, we get
Since and since satisfies the above PDE, if we apply Itô’s formula then we obtain
Hence Feynman-Kac representation together with the fact that yields
This shows that we have an explicit form of the process in (2), provided that
Thus we can consider the Monte-Carlo estimator.
To obtain a control variate we thus need to approximate
. If one used classical approximation techniques to the PDE, such as finite difference or finite element methods, one would run into the curse of the dimensionality - the very reason one employs Monte Carlo simulations in the first place. Artificial neural networks have been shown to break the curse of dimensionality in specific situations. However there is no known method that can guarantee the convergence to the optimal artificial neural network approximation. The application of the deep-network approximation to the solution of the PDE as a martingale control variate is an ideal compromise.
Even if we had exact solution to the PDE (6) we would need to discretise the integrals that arise in to obtain an implementable algorithm. To that end take a partition of denoted
and consider an approximation of (5) by . For simplicity we approximate all integrals arising by Riemann sums always taking the left-hand point when approximating the value of the integrand. Of course, more sophisticated quadrature rules are also available.
If there is no exact solution to the PDE (6), as would be the case in any reasonable application, then we will approximate by for . The implementable control variate Monte-Carlo estimator is then the form
where and is a free parameter to be chosen (because we discretise and use approximation to the PDE it is expected ). Again we point out that the above estimator is unbiased independently of the choice . We will discuss possible approximation strategies for approximating with in the following section.
In this section we propose 4 algorithms that attempt the learn PDE solution (or its gradient) and then use it to build control variate.
4.2. Direct approximation of by an artificial neural network
A first, and perhaps the most natural, idea to set up learning algorithm for the solution of the PDE (6) would be to use PDE (6) itself as score function. Let so that is an approximation of . One could then set a learning task as
in some appropriate norms and . Different choices of approximations of the derivatives and the norms would result in variants of the algorithm. Smoothness properties of would be important for the stability of the algorithm. The key challenge with the above idea is that it is not clear what the training data to learn should be (the domain is typically unbounded). For this reason we do not study this algorithm here and refer reader to for numerical experiments .
Before we proceed further we recall a well known property of conditional expectations, for proof see e.g. [27, Ch.3 Th. 14].
Let . Let be a sub -algebra. There exists a random variable
-algebra. There exists a random variablesuch that
The minimiser, , is unique and is given by .
The theorem tell us that conditional expectation is an orthogonal projection of a random variable onto .
4.3. Probabilistic representation based on Backward SDE
Instead of working directly with (6) we work with its probabilistic representation (7) and view it as a BSDE. To formulate the learning task based on this we recall the time-grid given by (9) so that we can write it recursively as
Next consider a deep network approximation
Approximation depends on weights , . We set the learning task as
Note that in practice one would also work with and moreover any minimisation algorithm can only be expected to find which only approximate the optimal . The complete learning method is stated as Algorithm 1.
We remark that the idea of approximating PDEs formulated as optimisation problem and using probabilistic representation already appeared in  and was recently combined with neural network approximations in series of works [38, 21, 22]. The learning problem stated by these authors is somewhat similar to (11). However, in [38, 21, 22], the value function is only approximated at a single point whereas we obtain approximation of and for any .
4.4. Feynman-Kac and automatic differentiation.
Automatic differentiation can be used to provide an variant of the method of Section 4.3. Instead of using a as an approximation of we can use automatic differentiation to applied to so that . The learning algorithm is then identical to Algorithm 1 but with replaced with . Recently very similar ideas have been explored in .
4.5. Bismut-Elworthy-Li Formula
Consider the solution to (5) with , that is
Define , for . Let be the matrix , (i.e. the Jacobian matrix) and let be the matrix , (i.e. the Jacobian of the map with fixed). It can be shown, see [28, Ch. 4], that the matrix valued process satisfies
is identity matrix. Bismut-Elworthy-Li formula that we state next, provides probabilistic representation for gradient of the solution to the PDE (6). This is in the same vein as Feynman-Kac formula provides probabilistic representation to the solution of (6).
Theorem 4.2 (Bismut-Elworthy-Li formula).
Fix . Let be continuous deterministic function such that . Then
We refer reader to [15, Th. 2.1] or [13, Th. 2.1] for the proof. Let us point out that in the above representation no derivative of is needed. This makes it appealing in financial applications for calculating greeks in particular . In the case that is sufficiently well behaved (so that the stochastic integral is a true martingale) we have . Thus one can see that
The resulting Monte Carlo estimator may enjoy reduced variance property, . To build a learning algorithm we use this theorem with and define
so that . Hence, by Theorem 4.1,
and we know that for a fixed the random variable which minimises the mean square error is a function of . But by the Doob–Dynkin Lemma [10, Th. 1.3.12] we know that every can be expressed as for some appropriate measurable . For the practical algorithm we restrict the search for the function to the class that can be expressed as deep neural networks . Hence we consider a family of functions with and set learning task as
where indicates that an approximation is used to minimise the function. Algorithm 2 describes the method including time discretization.
5. Learning martingale control variate functional
Let us go back to the general setting of Section 2. In particular this means that the contingent claim that we wish to know the price of at time is given by and so is possibly path-dependent.
In this situation we cannot express the process arising from Martingale representation theorem, see (2), in terms of a derivative of the associated PDE on . Nevertheless we can set up a learning task that tries to approximate this with a deep network. To that end we recall the time grid given by (9), the associated discretisation of the process which is and the discrete discount factors . We then propose to approximate at time using a deep networks with inputs being the entire discrete path up to time i.e.
As always represent the deep network weights that have to be chosen appropriately. The implementable control variate Monte-Carlo estimator now has the form
We now make several remarks:
We are not assuming that comes as a solution to an SDE here, it only needs to be a continuous adapted process (e.g. McKean–Vlasov SDE arising in local stochastic volatility models). However in practice one needs to be able to simulate this process to be able to set up the learning task.
It seems natural use (15) as an approximation to but in fact a direct approximation might perform equally well.
where is discrete-time square integrable martingale. In  a representation for is given using an (infinite) series of Hermite Polynomial. Such results in literature are known as discrete-time analogue of Clark-Ocone formula [35, 1].
The representation (17) provides another route to deriving (16): discretise the time first and then apply the discrete martingale representation as opposed to first applying the martingale representation theorem and then discretising time. Indeed we effectively have that
There are other possibilities for the choice of the form of . One could work with discrete time analogue of Clark-Ocone formula as already mentioned [4, 35, 1]. Alternatively one could build control variates using so called Stein operators as in [33, 6]. We leave it for the future research to explore this other possibilities.
This methods is effectively learning the hedging hedging strategy, see also .
Even though we said that there is no representation as a solution to some PDE on , there is a representation in terms of a path-dependent PDE, see .
We write that the network approximation depends on the entire path of up to . For practical learning one would typically use increments as inputs. There is also evidence that using iterated integrals as learning inputs is very efficient .
We will now formulate the the search for the control variate as learning tasks.
5.1. Empirical risk minimisation
Recall definition of given by (16). From (4) we know that the theoretical control variate Monte-Carlo estimator has zero variance and so it is natural to set-up a learning task which aims to learn the network weights in a way which minimises said variance:
Setting , the learning task is stated as Algorithm 3.
Note that in this case we learn the control variate by setting . In the next method we show that in fact there is a way learn control variate with optimal by directly estimating it.
5.2. Empirical correlation maximisation
This method is based on the idea that since we are looking for a good control variate we should directly train the network to maximise the variance reduction between the vanilla Monte Carlo estimator and the control variates Monte Carlo estimator by also trying to optimise .
Recall that . The optimal coefficient that minimises the variance is
Let denote the Pearson correlation coefficient between and i.e.
With the optimal