Arbitrage-Free Implied Volatility Surface Generation with Variational Autoencoders

by   Brian Ning, et al.

We propose a hybrid method for generating arbitrage-free implied volatility (IV) surfaces consistent with historical data by combining model-free Variational Autoencoders (VAEs) with continuous time stochastic differential equation (SDE) driven models. We focus on two classes of SDE models: regime switching models and Lévy additive processes. By projecting historical surfaces onto the space of SDE model parameters, we obtain a distribution on the parameter subspace faithful to the data on which we then train a VAE. Arbitrage-free IV surfaces are then generated by sampling from the posterior distribution on the latent space, decoding to obtain SDE model parameters, and finally mapping those parameters to IV surfaces.



page 1

page 2

page 3

page 4


Bayesian Approach for Parameter Estimation of Continuous-Time Stochastic Volatility Models using Fourier Transform Methods

We propose a two stage procedure for the estimation of the parameters of...

Forecasting security's volatility using low-frequency historical data, high-frequency historical data and option-implied volatility

Low-frequency historical data, high-frequency historical data and option...

Discrete Auto-regressive Variational Attention Models for Text Modeling

Variational autoencoders (VAEs) have been widely applied for text modeli...

A threshold model for local volatility: evidence of leverage and mean reversion effects on historical data

In financial markets, low prices are generally associated with high vola...

Parameter Learning and Change Detection Using a Particle Filter With Accelerated Adaptation

This paper presents the construction of a particle filter, which incorpo...

Interpolation of Missing Swaption Volatility Data using Gibbs Sampling on Variational Autoencoders

Albeit of crucial interest for both financial practitioners and research...

Fast Converging and Robust Optimal Path Selection in Continuous-time Markov-switching GARCH Model

We propose CROPS, a fast Converging and Robust Optimal Path Selection al...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modelling implied volatility (IV) surfaces in a manner that reflects historical dynamics while remaining arbitrage-free is a challenging open problem in finance. There are numerous approaches driven by stochastic differential equations (SDEs) that aim to do just so, including local volatility models [dupire1994pricing], stochastic volatility models [heston1993closed, hagan2002managing], stochastic local volatility models [said1999pricing], jump-diffusion models [cont2002calibration], and regime switching models [buffington2002regime], among many others. Such approaches make specific assumptions on the dynamics of the underlying asset and a choice of an equivalent martingale measure in order to avoid arbitrage. While these assumptions are not necessarily dynamically consistent with historical data, they do allow, e.g., pricing exotic derivatives via Monte Carlo or PDE methods.

An alternative to the SDE approach is to use non-parametric models to approximate IV surfaces directly without making assumptions on the underlying dynamics. For example, ML models such as support vector machines (SVMs) have been used to model such surfaces

[zeng2019online]. The issue of ensuring arbitrage-free surfaces is often tackled jointly during model fitting [bergeron2021autoencoders] either through penalisation of arbitrage constraints [ackerer2020deep] or by directly encoding them into the network architecture [zheng2019gated]. These approaches, however, typically do not provide any guarantees and often are not arbitrage-free across the entire surface. A recent intriguing approach [cohen2021arbitrage] is to reduce surfaces to arbitrage-free ‘factors’ – learned, e.g., through PCA – which can then be modeled using neural SDEs [li2020scalable]. This approach, while very promising, relies on the quality of the ‘factors’ which are often complicated to compute.

In this paper, we develop a hybrid approach to resolve these issues by using SDE models that are by construction arbitrage-free yet flexible enough to fit arbitrary IV surfaces. One immediate dividend of this approach lies in its ability to produce realistic synthetic training data that can be used to leverage deep learning pricing methods in downstream tasks

[ferguson2018deeply, horvath2019deep]. The class of SDE models we consider include time-varying regime switching models and Lévy additive processes detailed in Section 3. We avoid overfitting by incorporating a Wasserstein penalty to keep the SDE model’s implied risk-neutral density from deviating too far from the empirical one. The SDE model parameters, once fitted to data, represent a parameter subspace reflecting the features embedded in the data. The distribution on the subspace depends on the characteristics of the underlying asset and can be complex. We “learn” this distribution by using Variational Autoencoders (VAEs) which also allows for disentanglement of the subspace in an interpretable manner. SDE model parameters may be generated from the VAE model and used to create IV surfaces that are both faithful to the historical data but also strictly risk-neutral. This is similar in spirit, but distinct from, the tangent Lévy model approach introduced in [carmona2012tangent] where a Lévy density is used to generate arbitrage-free prices, while here, the VAE generates parameters of the SDE model.

The overall approach may be summarised as: (i) fit a rich arbitrage-free SDE model to historical market data to obtain a collection of parameters, (ii) train a generative model, in particular a VAE model, on the collection of SDE model parameters, (iii) sample from the latent space – using a kernel density estimator (KDE) approach – of the generative VAE model, (iv) decode the samples to obtain a collection of SDE model parameters, and (v) use said SDE model and parameters to obtain arbitrage-free surfaces faithful to the historical data. A flow-chart of the process is presented in Figure


The remainder of this article is organised as follows. Section 2 describes a generic method of fitting SDE models to a limited data set. Section 3 defines the financial models we use in calibration. Section 4 details the structure of the VAE and its generative process. Finally, Section 5 presents the results of our algorithm applied to 1,900 days of foreign exchange (FX) data for three currency pairs111AUD = Australian Dollar, USD = US Dollar, and CAD = Canadian Dollar: AUD-USD, EUR-USD, and CAD-USD.

IV Surface Data

Fit SDE Models

Train VAE

Encode SDE Parameters


Latent Variables

Train KDE

Decode Samples



Sample IV Surfaces

Sampled Parameters

SDE Model Parameters
Figure 1: Flow chart of algorithm starting from raw IV surface data to generated surfaces.

2 Model Setup and Estimation Procedure

We work with a completed filtered probability space

where the filtration is the natural one generated by a stochastic driver . We explore several choices of models for in Section 3. Here, represents the risk-neutral probability measure and we assume that the market prices options using this measure and model the FX rate process as follows:


where and are the domestic and foreign short rate processes, and is a deterministic function of time that ensures is a -martingale. We assume interest rates are deterministic since there is no conceptual difficulty in generalising to the stochastic case.

From, e.g., Theorem 3.2 in [lewis2001simple], we may write the undiscounted option price as


where ,

is the characteristic function of

, encodes the parameters of the stochastic driver , and denotes the real component of its argument. For the class of SDE models considered here, the characteristic function is known in closed form and the above formula allows for efficient calibration to market data.

A naive approach to parameter estimation is to minimise the squared error between the model and data prices. Such parameter estimation is prone to overfitting when data is sparse as is often the case in FX markets. To address this issue, we add a regularisation term to our objective. Specifically, we use the

-Wasserstein distance between the model price’s risk-neutral probability distribution function (pdf), denoted

, and the implied pdf derived from option data, denoted . Wasserstein distances provide a natural metric on the space of probability measures and have seen wide application across many fields [villani2009optimal]. Other choices include divergences, such as the Kullback-Liebler divergence, or metric variations such as Jensen-Shannon divergence. We chose the Wasserstein distance in particular because it is the most ubiquitous and robust.

To this end, the 1-Wasserstein distance between and is given by


where denotes the set of all probability distributions on with marginals and and the second equality holds in dimension one [vallender1974calculation]

. As data is observed at discrete strikes, we approximate the empirical density by interpolating IVs at each fixed maturity using B-splines. It is well known

[breeden1978prices] that corresponds to the risk-neutral density of the underlying asset price evaluated at . Since we are using a B-spline, the corresponding density is not necessarily risk-neutral. This is not a problem, however, as we merely use them as a regularisation term and ultimately use risk-neutral models to derive option prices.

We use a combination of the pricing error and the Wasserstein distance (3

) as the loss function in model estimation. That is we seek to obtain model parameters


where and

denotes the flattened vector of data and model prices, respectively, at each strike-maturity pair, and

and represent the model and data implied distribution functions at each maturity, respectively. The regularisation parameter controls the importance placed on being close to the spline implied densities.

3 Class of Stochastic Drivers

In this section, we describe the class of models over which we perform estimation. Throughout, we denote the sequence of dates on which we have option implied volatility data by and define .

3.1 Ctmc

Figure 2:

Typical fits of the CTMC model to the IVs (top) and densities (bottom) of AUD-USD data with root mean squared error in the 50th (blue) and 90th (red) quantiles.

The continuous time Markov-Chain model (CTMC) is a multi-regime model that assumes the underlying asset follows a Geometric Brownian motion (GBM) modulated by a continuous time Markov-Chain representing the current market regime. They were first introduced into financial modelling in

[buffington2002regime]. Here, however, we generalise the model to allow for time-varying parameters and use transform methods in [jackson2008fourier] to solve for the characteristic function. We use this model as a non-parametric approach to modelling the sequence of risk-neutral densities. The potential of overfitting of such models is mitigated by the Wasserstein distance penalty in (3).

Let denote a continuous time Markov chain taking on values in . Suppose, moreover, that the generator matrix driving the CTMC, the regime specific vector of drifts , and the regime specific vector of volatilities are all constant on the sequence of maturity intervals , but may vary across maturity periods. We can then consider a driving process satisfying the SDE


where and denote the expected return and instantaneous volatility in period when , . Similarly, we let denote the transition rate matrix for period , satisfying , , and .

Proposition 1

If satisfies (5), then the characteristic function is given by where

is the prior probability of the latent state,


and is the Kroencker delta, which equals if and otherwise.


See Appendix

To assist with identifiability when estimating the CTMC model parameters, we introduce a cyclic structure on the transition rate matrices. More precisely, we require that , for all , , , and , together with the usual constraint that , for all .

Figure 2 shows the CTMC model fitted to two days of data including the corresponding implied densities.

Model Characteristic Function
Double Exponential JD
Gaussian Mixture JD
Table 1: Summary of characteristic functions within a given maturity period.  

3.2 Lévy Additive Processes

For comparison, we also study a class of Lévy additive processes to allow jumps in FX rates. To this end, we model as


where is piecewise deterministic, , is a Poisson random measure with compensator , and where with being Lévy measures. This allows the structure of the Lévy measure to differ between maturity periods and allows for both finite and infinite activity processes. For example, we may have , in which case is an additive compound Poisson process with jump measure and intensity , or in which case is additive version of a tempered stable Lévy measure (also known as the KoBol [boyarchenko2000option, boyarchenko2002non] or CGMY [carr2002fine] models). There are a slew of alternate models as well, however, in the sake of brevity we restrict to these two classes. For the additive compound Poisson process, we include two jump measure types: mixture of normals , where is the standard normal pdf – to mimick a non-parametric estimation of the jump distribution implied by the data, and a double exponential model (see [kou2004option]), in which case .

Proposition 2

If satisfies (7), then the characteristic function is given by , where



Apply the Lévy-Khintchine formula [applebaum2009levy, Chap 1.2.4] within each period.

The specific form of the Lévy characteristic function appearing in (8) for the models we employ in the numerical analysis appear in Table 1. The parameters for the appropriate period should be inserted into these expression when computing the full characteristic function. We also record the characteristic function for the CTMC model in the same table.

4 Model Parameter Generation

In the previous section, we described two classes of stochastic models that can be calibrated to option prices. The SDE model parameters are estimated by minimising a weighted average of the mean-squared error in option prices and the Wasserstein distance between the model implied risk-neutral densities and the densities implied by a -spline fit of the IV smiles. Once the SDE model parameters are estimated on market data, our goal is to generate new synthetic parameters that are consistent with the historical data. This allows us to produce synthetic IV surfaces that are guaranteed to be both arbitrage-free and representative of real surfaces.

4.1 Variational Autoencoder (VAE)

Variational Autoencoders [kingma2013auto] are generative models that aim to train a multivariate latent representation, known as an encoding, from a collection of data. Given a set of samples from a distribution parameterized by ground truth latent factors where and a generative model , we seek to maximise the log-likelihood . The log-likelihood is, however, intractable as it involves integration over the posterior , which, even in setups where

is specified, is itself intractable. The VAE circumvents this issue by introducing an approximation of the true posterior with a neural network

parameterized by . Indeed, for any distribution , we have that the log-likelihood satisfies the inequality equationparentequation

The right most expression of this inequality is known as the evidence lower bound (ELBO) of the log-likelihood. It can be shown that the precise gap between the left and right hand sides of this inequality equals the KLD (KL-Divergence) between and , i.e.,


VAEs then view the negative ELBO as a loss and, rather than maximising the intractable log-likelihood, aim to minimize




Figure 3: VAE architecture.

where the first term is known as the KLD loss and the second term the reconstruction loss. In principle, the prior, posterior, and generator can be arbitrary distributions. In practice, however, this typically leads to an intractable ELBO. Thus, we assume they are from the family of Gaussian distributions with diagonal covariance matrices, and that the prior is the isotropic unit Gaussian. There is no real loss of generality in light of the universal approximation theorem. Specifically, we assume

, , and , where and are diagonal matrices.

The neural network that parametrises is called the encoder as it “encodes” the data into its latent representation, while the neural net that parameterises is called the decoder as it “decodes” latent representation to recover the original data. Figure 3 shows the typical structure of a VAE. We train the encoding and decoding networks simultaneously but only use the decoder at inference time.

4.2 -Vae

The -VAE [higgins2016beta]

is a modification of the traditional VAE objective that introduces an adjustable hyperparameter

, precisely the ELBO is modified to


Larger values of result in more disentangled latent representations , while smaller values result in more faithful reconstructions. is often chosen to be greater than one to encourage disentanglement; however, this constrains latent information and can lead to poorer reconstructions [higgins2016beta]. Instead, we may choose smaller values of to improve reconstructions and rely on a KDE to accurately sample from the less structured latent space.

4.3 Latent Sampling

Typically, samples from a VAE are generated by sampling from the latent prior and decoding the result. From a Bayesian perspective, however, conditional on a set of observed data , it is more appropriate to sample from the posterior . While this integral is typically intractable, it is possible to sample from. For this purpose, we use a multivariate Gaussian KDE with bandwidth determined by Scott’s Rule [scott2015multivariate] to provide a non-parametric approximation of it.

Returning to our specific IV context, let , , be the set of historical inputs to the VAE, i.e., the SDE model parameters obtained by fitting our historical data. The variational approximation

provides us with the posterior mean and standard deviations

of the latent factors. For each , we obtain samples from the corresponding posterior to obtain a collection of samples of the latent space, and use these points to approximate via a KDE.

5 Results

5.1 Market Data

We apply our hybrid method to IV data for three currency pairs provided by Exchange Data International: AUD-USD, EUR-USD, and CAD-USD for the 1,900 days between September 18th, 2012 to December 30, 2019 split equally into training and testing set in a random manner. The data includes option prices with five strikes at each of eight different maturities (1M, 2M, 3M, 6M, 9M, 1Y, 3Y, 5Y). Foreign exchange option prices are quoted in terms of deltas rather than strikes and in terms of at-the-money call, risk reversal, and butterfly spread options, rather than simple calls. We use standard formulas [reiswich2012fx] to convert raw quotes into IVs at deltas of 0.1, 0.25, 0.5, 0.75, and 0.9 for each maturity.

5.2 SDE Model Specifics

CTMC 8.1 5.0 6.0
DE JD 62.0 42.9 64.2
GM JD 10.2 17.0 24.3
CGMY/KoBoL 140.2 151.3 136.2
Table 2: Median rmse () for the collection of models and FX pairs.

The CTMC model assumes three regimes with a Wasserstein’s distance penalty222The Wasserstein penalty for the various models are chosen from a grid search and balancing goodness of fit to IV smiles with goodness of fit to the implied risk-neutral densities. of . To reduce the number of parameters to fit, initial regime probabilities are set to . To take advantage of the label invariance of the transition matrix, the mean of each regime is assumed to be in ascending order, i.e., . For the Lévy additive processes, we fit the parameters subject to a Wasserstein’s distance penalty of for time to maturity (TTM) less than 1 year and otherwise to increase the regularising power at larger TTM. Moreover, we apply a penalty of on day-to-day parameter percentage changes to stabilise model parameters without sacrificing fit quality. For the Gaussian Mixture JD model, we assume a mixture of two Gaussians, as adding more factors did not increase the fit quality. The penalty varies with maturity as long maturities tend to have stable pdfs but less stable IVs, while shorter maturities tend to have less stable pdfs. Table 2 show the median root-mean squared error (rmse) across days, where the rmse on a given day is computed across all Delta/maturity pairs, for the collection of models and FX pairs we study.

5.3 VAE Model Specifics

The encoder and decoder of the VAE have four fully connected hidden layers, with 64, 128, 256, and 512 nodes each, and a single output layer mapping to the appropriate dimensions. We perform a grid search over values and number of latent dimensions summarized in Table 3

. The table reports an evaluation metric described in the next subsection. Training is carried out with batches of

randomly sampled days from the training set. We set a fixed training duration of epochs as we find that is usually sufficient to train the -VAE.

5.4 Evaluation Metric

Latent Dimension
Model 3 5 10 15 3 5 10 15 3 5 10 15
CTMC 0.01 0.58 0.84 1.43 1.53 0.34 0.56 0.91 1.45 0.30 0.78 0.78 1.02
0.1 0.45 0.66 0.72 1.25 0.34 0.36 0.65 0.84 0.38 0.39 0.88 0.84
1 0.33 0.59 0.61 0.51 0.31 0.34 0.42 0.38 0.24 0.27 0.37 0.39
10 2.68 2.56 2.30 2.29 3.01 2.84 2.46 2.62 2.59 2.75 2.11 2.13
DE JD 0.01 1.26 0.93 1.20 0.84 0.21 0.25 0.25 0.46 0.16 0.17 0.17 0.20
0.1 1.06 1.49 0.84 1.10 0.29 0.24 0.46 0.24 0.14 0.20 0.26 0.21
1 1.02 0.94 0.97 0.88 0.74 0.46 0.42 0.42 0.28 0.18 0.19 0.20
10 1.12 1.68 1.36 1.39 1.48 1.08 1.67 1.34 0.61 0.53 0.57 0.78
GM JD 0.01 1.45 1.78 1.96 1.98 0.57 0.48 0.66 0.60 0.38 0.40 0.48 0.47
0.1 2.03 1.90 1.78 1.84 0.67 0.50 0.59 0.51 0.41 0.40 0.48 0.36
1 1.78 1.76 1.62 1.82 0.62 0.63 0.70 0.58 0.51 0.55 0.45 0.45
10 2.55 2.85 2.64 3.07 1.55 1.73 1.56 1.26 1.10 1.21 1.07 1.47
CGMY/KoBoL 0.01 1.56 1.30 1.43 1.23 0.54 0.61 0.57 0.66 0.46 0.47 0.40 0.45
0.1 1.34 1.17 1.40 1.21 0.62 0.57 0.59 0.61 0.44 0.44 0.48 0.42
1 1.93 1.21 1.33 1.44 0.66 0.67 0.74 0.62 0.45 0.45 0.44 0.46
10 2.35 2.30 2.25 2.11 1.60 1.79 1.67 1.92 0.95 0.82 0.94 0.66
Table 3: Wasserstein metrics () for varying levels of , latent dimensionality, and currency pair. Bold numbers are the smallest metric within each subgrid.  

Our goal is to generate arbitrage-free IV surfaces that are faithful to the historical dataset. Here, we describe a natural metric that allows us to assess how well we meet this goal. Let and denote the probability distribution over IV surfaces for the trained model using the overarching algorithm and true distribution, respectively. As Wasserstein distances provide a natural metric on the space of probability measures, we use the -Wasserstein distance between the trained and true distribution as our performance metric. While the true distribution is unknown, the data provides a finite sample from it at a set of discrete 2-dimensional grid points (the collection of Delta/TTM pairs which are observed). Further, the trained model’s distribution may be estimated by sampling from the posterior distribution in latent space (using the KDE method described in Section 4.3) and decoding to produce SDE model parameters, which can be mapped to a sample of IVs at the set of grid points . The -Wasserstein distance between the true and the model’s distribution may be estimated by the -Wasserstein distance between the multi-variate distribution of IVs at grid points for the test data and the model generated ones. We refer to this quantity as the Wasserstein metric.

5.5 Results Summary

Figure 4: Testing Parameters
Figure 5: Generated Parameters
Figure 6: Generated Samples
Figure 7: Scatter plots of the three regime specific and from (a) fitting to test data, and (b) random sample from the generative model, using the CTMC model on AUD-USD data. Colours represent three different regimes. Panel (c) shows four randomly generated smiles from the VAE – each colour corresponds to a sample.

Table 3 shows a complete summary of the Wasserstein metric computed for each currency pair on a range of values and latent dimensions. The best metric within each grid search is set in bold. For the CTMC model, we find the best results when and with latent dimensions. Increasing the number of latent dimensions does not generally increase performance. This suggests that for these currency pairs, most surfaces can be captured with three factors. Interestingly, for both classes of SDE models, decreasing does not lead to significantly worse performance, while increasing proves to be severely detrimental. This is an indication that the posterior sampling from Section 4.3 performed admirably in highly unstructured latent spaces, a by-product of low values. For the three Lévy additive processes explored here, the double exponential JD performs the best for all three currency pairs but the optimal hyperparameters vary across currency pairs.

As a benchmark metric, we use a functional data analysis (FDA) approach for modelling the IV surfaces and compare the Wasserstein metric for this approach with our VAE results. For an overview of FDA, see [ram]. In this approach, we regress the IV surfaces from the training set directly onto Legendre Polynomials of degree 3 (in Delta & TTM), use a KDE of the distribution of coefficients, simulate IVs on the grid , and compute the Wasserstein metric for the three currency pairs. For AUD-USD, EUR-USD and CAD-USD, the Wasserstein metrics are , , and , respectively. Given the results from Table 3, this suggests that, with the best hyperparameters, the VAE is able to generate better faithful-to-history surfaces than those produced through FDA. It is important to note that we expect the FDA to perform well, as it is a direct fit of historical IV that is not hindered by forcing the absence of arbitrage. The VAE generated surfaces, however, are guaranteed to be arbitrage-free.

Figure 7

shows a comparison between (a) the CTMC parameters obtained by fitting to test data, and (b) random samples from the corresponding VAE model. As the figure shows, the VAE successfully captures the various complex structures that are inherent in the test data across all maturities and regimes. Furthermore, panel (c) of the figure shows four randomly generated smiles from the VAE model, and demonstrates that the VAE is able to capture realistic variation in the average volatility, convexity, and skewness of smiles.

6 Conclusions

To summarise, we propose a hybrid approach for generating synthetic IV surfaces by first calibrating SDE model parameters to historical data – using a Wasserstein penalty between the implied model risk-neutral distribution and that induced by option data as a regularisation term – and then training a rich VAE model to learn the distribution on the space of SDE model parameters. We show that the distribution of IV surfaces from the VAE model is very close to that from our out-of-sample test data, and performs well in comparison with an FDA model, while ensuring the generated surfaces are arbitrage-free.


7 Appendix

Proof (Proof of Proposition 1)

Denote . As is Markov, there exists a function , such that . Moreover, applying the Feynman-Kac theorem, we have that the function satisfies the coupled system of PDEs


subject to the terminal condition (t.c.) , and denotes the infinitesimal generator, given , which acts upon twice differentiable functions as follows


To solve the coupled system of PDEs (13

), we apply the Fourier transform defined as

to both sides of (13). Recall that . Thus,


Using the above, and denoting , (13) may be written in Fourier space as


s.t. the t.c. , where and is the Dirac delta function. We may further rewrite this system of equations in matrix notation by (i) defining the matrix whose entries are where is the Kroencker delta, and (ii) defining the vector of transformed prices . Thus, (16) may be written as a vector-valued ODE


s.t. the t.c. . This system may be solved explicitly by backward induction. For (the last period), the matrix ODE admits the solution Next, due to continuity, we have that . Using this limit as the t.c. at , for , we solve , which admits the solution

Continuing iteratively, we obtain Continuing iteratively we arrive at


Averaging over the prior on , and taking the Fourier inverse (which is trivial due to the Dirac delta function), we obtain the stated result.