Computation of optimal transport and related hedging problems via penalization and neural networks

by   Stephan Eckstein, et al.
University of Konstanz

This paper presents a widely applicable approach to solving (multi-marginal, martingale) optimal transport and related problems via neural networks. The core idea is to penalize the optimization problem in its dual formulation and reduce it to a finite dimensional one which corresponds to optimizing a neural network with smooth objective function. We present numerical examples from optimal transport, martingale optimal transport, portfolio optimization under uncertainty and generative adversarial networks that showcase the generality and effectiveness of the approach.


page 1

page 2

page 3

page 4


Optimal transport mapping via input convex neural networks

In this paper, we present a novel and principled approach to learn the o...

MinMax Methods for Optimal Transport and Beyond: Regularization, Approximation and Numerics

We study MinMax solution methods for a general class of optimization pro...

Learning Cost Functions for Optimal Transport

Learning the cost function for optimal transport from observed transport...

Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1-D Frank-Wolfe

Unbalanced optimal transport (UOT) extends optimal transport (OT) to tak...

Feature Robust Optimal Transport for High-dimensional Data

Optimal transport is a machine learning technique with applications incl...

A scalable deep learning approach for solving high-dimensional dynamic optimal transport

The dynamic formulation of optimal transport has attracted growing inter...

Physics Informed Convex Artificial Neural Networks (PICANNs) for Optimal Transport based Density Estimation

Optimal Mass Transport (OMT) is a well studied problem with a variety of...

Code Repositories


Code for numerical section of "Computation of optimal transport and related hedging problems via penalization and neural networks"

view repo

1 Introduction

In this paper we present a penalization method which allows to compute a wide class of optimization problems of the form

by means of neural networks. The most widely known representative of such a functional occurs in the optimal transport problem, to be introduced shortly. More generally, these functionals appear for instance in the representation of coherent risk measures [4] as the worst-case expected loss over a class

of scenario probabilities, in the representation of nonlinear expectations

[41], or as the upper bound of arbitrage-free prices for a contingent claim , see e.g. [25]. To solve the initial problem we will make use of its dual formulation and restrict to the subclass of those optimization problems which can be realized as a minimal superhedging price

for some , where is a set of continuous and bounded functions , where the relation of and is given at the beginning of Section 2. A very similar class of optimization problems in an abstract framework of Banach lattices is studied in [21]. Under sufficient regularity conditions the values of the primal problem and its dual problem can be shown to coincide, see e.g. [16] for related pricing-hedging dualities.

A typical example is the Kantorovich relaxation [36] of Monge’s optimal transport problem, where is the set of probability measures on a product space with given marginals and , and where is the set of all continuous and bounded functions and . Further frequently studied problems in this class include multi-marginal optimal transport and Wasserstein distances (see e.g. [5, 50, 51]), martingale optimal transport (see e.g. [7, 26, 31, 33]), value at risk under dependence uncertainty (see e.g. [11, 22, 44]), or calculating worst case copula values and improved Fréchet-Hoeffding bounds (see e.g. [6, 40]). Moreover, serves as a building block for several other problems, like generative adversarial networks (where additionally, the optimization includes generating a distribution, see e.g. [3, 23, 30]), portfolio choice under dependence uncertainty (where additionally, portfolio weights are optimized, see e.g. [10, 43]), or robust optimized certainty equivalents (see e.g. [20]). In these cases, the solution approach presented in this paper is still applicable.

Summary of the approach

The goal is to solve numerically. The implementation will build on the dual representation of . The first step is to go over to a finite dimensional setting, where the set is replaced by a subset :

Theoretically, we will look at a sequence with such that is in a certain sense dense in . More concretely, can be a set of neural networks with a fixed structure (but unspecified parameter values), and

measures the number of neurons per layer.

To allow for a step-wise updating of the parameters (e.g. by gradient descent methods) for the space , the inequality constraint is penalized. To this end, we introduce a reference probability measure on the state space . Intuitively, this measure will be used to sample points at which the inequality constraint can be tested. Further, we introduce a differentiable and nondecreasing penalty function . This leads to the penalized problem

For theoretical considerations we also introduce

Theoretically, we will again consider sequences of penalty functions parametrized by a penalty factor , and use the notation and . Here, an increasing penalty factor can be seen as a more and more precise enforcing of the inequality constraint .

The problems are the ones which are solved numerically. Chapters 2 and 3 study the relation between this problem which is eventually implemented, and the initial problem . To this end, we analyse how the introduced approximative problems behave for and . Figure 1 summarizes the occurring problems and their relations. Notably, we are only interested in convergence of optimal values, not that of optimizers.

Figure 1: Occurring problems and their relations. The depicted convergences are studied in Section 2 and, in a more specific context of neural networks, in Section 3.

The final step is to find a numerical solution of , which means in practice finding the optimal parameters of the network

. We use Tensorflow

[1] and the Adam optimizer [38] to this end, and thus mostly regard this step as a black box. We will denote the numerical optimal solution by .

Implementation method: Related literature

Penalization of optimal transport problems has been studied in several works (see e.g. [9, 14, 17, 18, 27, 30, 45, 46, 48]). Entropic penalization in particular is applied often, which is in close relation to the Schrödinger problem [39]. Cominetti and San Martín’s work [17]

from 1994 on entropic penalization of arbitrary linear programs can be applied to purely discrete optimal transport. The basic idea in

[17] is to obtain a strictly convex problem through penalization which can be solved quicker and converges to the initial problem, for an increasing penalty factor. More recently, Cuturi [18] gives an efficient algorithm to compute discrete optimal transport problems with two marginals based on entropic penalization and Sinkhorn’s matrix scaling algorithm. Genevay et al. [27] and Solomon et al. [48] go further in this direction and give algorithms to compute arbitrary optimal transport problems with two marginals, where the algorithm (for the case of continuous marginals) is based on a reproducing kernel Hilbert space approach, and discretization, respectively. In [27] the authors already mention that more general regularizations beyond the entropic one are possible. Among others Benamou et al. [9] and Schmitzer [45] use scaling algorithms related to [18] for a larger class of problems, including for example (discrete) multi-marginal, constrained and unbalanced optimal transport. Carlier et al. [14] show -convergence of the entropic penalized Wasserstein-2 distance to the unpenalized one. The same kind of -convergence is also subject of the studies related to the Schrödinger problem [39], even for more general cost functions. Recent research by Arjovsky et al. [3, 30] inspired by generative adversarial networks include solving a particular optimal transport problem (the Wasserstein-1 distance) based on penalization. In these works, the numerical approach to solve optimal transport problems by parametrization of the dual variables by neural networks originated. Seguy et al. [46] apply a neural network based approach to arbitrary optimal transport problems with two marginals. Their theoretical results are broadly based on entropic penalization, discretization, and weakly continuous dependence of the optimal transport problem on the marginals.


The current paper gives a unifying numerical solution approach to problems of the form based on penalization and neural networks. The focus lies both on general applicability with respect to the choice of problem, and also on a flexible framework regarding the solution method.

Compared to the existing literature, which often focusses on a single representative (often the optimal transport problem) among problems of the form , our theoretical results are widely applicable. Similarly, the penalization method and the resulting dual relations in this paper allow for many different forms of reference measure and penalty function , while the existing literature is often restricted to uniform or product reference measures, and exponential penalty functions.111

In discrete settings, the reference measure is usually the uniform distribution (see e.g. 

[18], where the penalization does not explicitly include a reference measure. The penalization applied is simply the entropy of a measure, with corresponds to the relative entropy with uniform reference measure). In non-discrete settings, usually the product measure of the marginals specified by the optimal transport problems are used (see e.g. [27, 46]). We show the effects of different reference measures and different penalty functions both theoretically in Theorem 2.2 and practically in the numerical examples in Section 4. In some examples the choice of an appropriate reference measure is crucial, see e.g. Section 4.4. Equation (2.6) of Theorem 2.2 also motivates an updating procedure for the reference measure to reduce the error arising from penalization, which is applied in Section 4.5.

The presented approach is showcased with several examples, which are mostly toy problems taken from existing papers. The reason we use toy problems is to allow for an evaluation of the numerical methods that can be based on analytical solutions.

Structure of the paper

In Section 2 we present the theoretical results on approximation and regularization. Section 3 discusses the particular case of as built by multilayer feedforward networks. In Section 4 we illustrate the proposed method with several examples. All proofs are postponed to Section 5.

2 Regularization and approximation of hedging functionals

Let be the set of all Borel probability measures on a Polish space , and denote by the linear space of all continuous bounded functions . We consider the superhedging functional


for , where is a pricing measure and . Throughout this section we assume that is a linear space which contains the constants (i.e. the constant functions).

In order to derive a dual representation, we assume that is continuous from above, i.e.  for every sequence in such that . By the nonlinear Daniell-Stone theorem it has a representation


for all , and the nonempty set . In particular . The problems (2.2) and (2.1) are in duality and we refer to (2.2) as the primal and (2.1) as the dual formulation. For the details we refer to the Appendix A. There it is outlined how the duality extends to unbounded functions. However, for the sake of readability we focus on .

The following example illustrates the basic setting:

Example 2.1.

Let , and denote by the set of all with first marginal , second marginal , etc. In the following examples, under the assumption that it is straightforward to verify that the corresponding superhedging functional is continuous from above.

  • (Multi-marginal) optimal transport [36, 51]:

  • Martingale optimal transport [7, 26]:

    where denotes the space of all continuous functions of linear growth corresponding to , see Appendix A. By Strassen’s theorem [49] the set is nonempty if the marginals are in convex order.

  • Optimal transport with additional constraints:

    for some and . For related problems we refer to [6] and the references therein.

2.1 Regularization of the superhedging functional by penalization

Our goal is to regularize the superhedging functional by considering the convolution


where for a sampling measure , and is a penalty function which is parametrized by . We assume that is a differentiable nondecreasing convex function such that . Its convex conjugate

satisfies . Common examples are

  • the exponential penalty function with conjugate ,

  • the penalty function with conjugate where for some .

In case that the functional (2.3) is a so-called optimized certainty equivalent, see Ben-Tal and Teboulle [8]. In the following result we show the dual representation of the regularized superhedging functional and its convergence to .

Theorem 2.2.

Let . Suppose there exists such that and . Then




whenever is an -optimizer of (2.2) such that and .

If is a minimizer of (2.3) then defined by


is a maximizer of (2.4).

2.2 Approximation of the superhedging functional

In this subsection we consider a sequence of subsets of , and set . For each , we define the approximated superhedging functional by


For the approximation of by , we need the following density condition on .

Condition (D):

For every and holds

  • for every there exists such that ,

  • there exists such that and for some compact subset of .

In Section 3 we will discuss Condition (D) in the context of multilayer feedforward networks. The condition allows for the following approximation result.

Proposition 2.3.

Assume that is a linear space which contains the constants. Under Condition (D) one has

for all .

Given a sampling measure and a parametrized penalty function as in the previous subsection, we define the approximated version of the regularized superhedging functional by


for all . As a consequence of the two approximative steps for in Theorem 2.2 and for in Proposition 2.3 we get the following convergence result.

Proposition 2.4.

Suppose that satisfies Condition (D) and for every there exists an -optimizer of (2.4) such that and . Then, for every one has for .

The existence of such -optimizers as required in Theorem 2.2 and Proposition 2.4 is for example established in [12] in the context of multi-marginal optimal transport problems in with absolutely continuous marginals. In general, the existence of such -optimizers crucially depends on the choice of , see also Example 3.6 for a simple illustration.

3 Modelling finite dimensional subspaces with multilayer feedforward networks

This section explains the specific choice of approximative subspaces as built by neural networks. Generally, a feasible alternative to neural networks is to build these spaces via basis functions, like polynomials, which is for example pursued in [33] in the context of martingale optimal transport. In contrast to a basis approach, where functions are represented as a weighted sum over fixed basis functions, neural networks rely on the composition of layers of simple functions. This has shown to be an efficient way to approximate a large class of functions with relatively few parameters. Before going into the results, we give the required notation for neural networks.

3.1 Notation

The type of neural networks we consider are fully connected feed-forward neural networks. Those are mappings of the form

where are affine transformations and is a nonlinear activation function that is applied elementwise, i.e.  for .

Regarding dimensions, there is an input dimension and a hidden dimension . This means maps from to , map from to , and maps from to . Each affine transformation can trivially be represented as for a matrix

and a vector

. All these matrices and vectors together are the parameters of the network, which can be regarded as an element of for some .

We will require the sets which contain all feed-forward neural networks with fixed structure (i.e. fixed number of layers and fixed dimensions) but unspecified parameter values. We denote by the sets of possible parameters for a fixed network structure (where formally, depends on the structure of the network), and by a particular neural network with layers, input dimension , hidden dimension and parameters . We denote the set of all such networks for by .

In the remainder of this section, we work with a fixed number of layers and input dimension, but allow for growing hidden dimension. For different hidden dimensions , denote by the corresponding parameter sets. We define

We want this definition to be independent of the precise choices of the parameter sets, which is why we make the standing assumption that the sets are growing in . One way to make this explicit is:

Assumption 3.1.

For any and a sequence of parameter sets , where is regarded as a subset of for some , we will always assume that and for all .

The only reason why we do not just set is that in Proposition 3.7 we make the assumption of compact parameter sets. Further, we assume

Assumption 3.2.

The activation function

is continuous, nondecreasing and satisfies the limit properties and .

3.2 Modelling via neural networks

In the following we assume that is of the form

where and are continuous functions for all . This form of includes many different problems, for instance the ones considered in Example 2.1 (e.g. in (a) one has where denotes the projection on the -th marginal component).

We approximate by

and its subspaces

In this context the problems are given by

for all . The final formulation illustrates that the problem is now reduced to a finite dimensional problem of finding the optimal parameters in a neural network. Further, the overall objective depends smoothly on the parameters, and the parameters are unconstrained. In short, problem

fits into the framework of machine learning problems that can be numerically solved by standard stochastic gradient descent based methods.

Under the standing Assumptions 3.1 and 3.2, the following lemma establishes situations when Condition (D), which is required for Proposition 2.3, is satisfied in the neural network setting.

Lemma 3.3.

satisfies the first part of Condition .

If and for , where is the projection from to -th marginal component , then satisfies the second part of Condition . Further, the second part of Condition is trivially satisfied whenever is compact.

Notably, part (b) can be seen as a large, but still exemplary case. Intuitively, the second part of Condition (D) is satisfied whenever the space is rich enough.

Remark 3.4.

Later in the numerics we will usually work with a ReLU activation function, i.e. 

. While this does not satisfy the latter limit property of Assumption 3.2, this is easily amendable: Basically, throughout the whole theory the assumptions will only be used to guarantee existence of neural networks with certain properties. Given Assumption 3.2, we will only require two layers () to obtain the necessary results. In the numerics however, we use more layers. If more layers are given, one can also bundle several layers and regard them as one layer, with a different activation function. For example:

Whenever is a mapping of the form , an -layer network with activation function can represent any function that a two layer network with activation function can represent. For one can easily see that is feasible, which satisfies Assumption 3.2.

3.3 Convergence

In this section we study in what sense converges to for the approximation by neural networks.

First, we study the case of uniform convergence in and , i.e. conditions for the convergence for . This is subject of Remark 3.5 below, which is a summary of results established in Section 2 and Section 3.2. The two approximative steps leading to uniform convergence are for and for .

On the other hand, sometimes the convergence for is not satisfied even though practically one obtains a good approximation. One such case is given in Example 3.6. Even if uniform convergence does not hold, one can still often connect problems and . This is done by the approximative steps for and for , where the latter is subject of Proposition 3.7. Here, instead of the strong assumption required for , the convergence can be shown by assuming that all parameter sets of the neural networks are compact.

Remark 3.5.

Under the assumptions of Lemma 3.3, Proposition 2.3 implies for . Further, given the existence of -optimizers for every as required in Theorem 2.2, convergence for holds. Given both assumptions, Proposition 2.4 yields for . The convergence for is a trivial consequence.

Example 3.6.

Let , and . Let be the set of all measures in with first marginal and second marginal , so that

Obviously, . Note that so that .

Consider two possible reference measures, being the uniform distribution on , and . For it is obvious that the existence of -optimizers as required in Theorem 2.2 is given, since itself is the optimizer of . Hence for holds.

On the other hand, there does not exist with , and hence . However, by first approximating by , the functional becomes smoother: Roughly speaking, the marginal constraints are slightly relaxed. This becomes obvious when studying the dual formulations

While one easily finds sequences of functions in so that the values at go to minus infinity but the penalty term stays bounded, this is impossible with functions in given that the activation function is continuous and the parameter sets are compact. So there is hope to establish the convergence for , which will indeed be a consequence of the following result.

Proposition 3.7.

Fix . Given that all parameter sets for of the neural networks occurring in are compact and is strictly positive (i.e.  gives positive mass to every non-empty open set), it holds for .

4 Numerical Examples

This section aims at showcasing how various frequently studied problems that fall into the theoretical framework of the previous sections can be implemented simply and effectively with neural networks. The examples focus on toy problems that allow an objective evaluation of the numerical results and give the reader an idea about the strengths and weaknesses of the presented approach. We chose a very basic implementation using Tensorflow and the Adam optimizer.222All used code is available on As for the network architecture: In all the examples, is as described in Section 3, and always approximates . To approximate we use a five layer ( in the previous chapter) ReLU-network with hidden dimension . We did not perform a hyper parameter search to obtain this architecture, but rather oriented ourselves at papers with comparable settings (e.g. at [19, 30, 46]). Notably, increasing the complexity (number of layers or hidden dimension) further did not change the numerical results significantly in the cases tested, so we believe the structure chosen to be adequate for the problems considered.

Simply put, the implementation works as follows: We perform a normal stochastic gradient type optimization (outsourced to the Adam optimizer) for a certain number of iterations to find near optimal parameters of the network. At each iteration during this process, the expectations in the objective function are replaced by averages over a fixed number (called batch size) of random points from the respective distributions. To obtain the numerical approximation of , we finally average the sample objective values over the last roughly of iterations. This is referred to as the dual value. Alternatively, one can use formula (2.6) to obtain sample points from an approximate optimizer of the primal problem and numerically evaluate , which is referred to as the primal value (more details on how to work with such an approximative optimizer is given in Section 4.5). If not stated otherwise, all reported values are dual values.

The numerical procedure we use can likely be improved by fine-tuning parameters or by using more complex network architectures. For example batch normalization is applied in a related setting in

[13] which appears to significantly speed up the optimization.

4.1 Optimal transport and Fréchet-Hoeffding bounds

With this first problem, we study the effects of different penalty functions, penalty factor, batch size and number of iterations of the Adam optimizer.

Figure 2: Fréchet-Hoeffding bounds: . Comparison of penalty function and exponential penalty function . The values plotted are running averages over the last 1000 iterations. The dotted red line is the true value . The dotted blue lines are bounds from below for obtained by Equation (2.5) in Theorem 2.2 for the respective choices of .

Let (where denotes the uniform distribution) and , where is the -th marginal of . For some fixed , define the function by333The function here is not continuous. Since the optimal transport problem is continuous from below (see e.g. [37]), the representation (2.2) nevertheless holds for all bounded measurable functions .

The value corresponds to the maximum value of a -dimensional copula at point . By the Fréchet-Hoeffding bounds we have an analytical solution to this problem, which is

In Figure 2 we observe how depends on the number of iterations of the Adam optimizer and the batch size. We observe that while higher batch sizes lead to more stable convergence, the speed of convergence appears not strongly related to batch size. This suggests that increasing batch sizes might lead to both quick and finally stable performance.444See also [47] and references therein for related concepts on how to optimally tune the optimization procedure. In this paper however, we decided to stick with standard parameters of the Adam optimizer and fixed batch size. This is done to avoid another layer of complexity when evaluating the numerical results. Since penalization appears more stable, we will mostly use this penalization for the rest of the applications. Further, the figure illustrates that the numerical solutions appear to approximately obtain the lower bounds for as given by Equation (2.5) in Theorem 2.2. I.e. one approximately has where is an optimizer of .555Here, is chosen as the optimizer which is uniform on the cubes and .

4.2 Multi-marginal optimal transport

MC quantization dual primal Laplace Com.
p = q = 2
0.413 0.401
3.279 3.258
3.123 3.041
p = 1, q = 2
1.537 1.531
1.753 1.740
6.759 6.743
18.001 17.539
34.235 32.521
Table 1: Multi-marginal optimal transport: Numerical values for

arising from different numerical schemes. The numbers in brackets are empirical standard deviations over 100 runs. LP denotes linear programming, based on either sampling the marginals randomly (MC) or using the quantization approach from

[42, Algorithm 4.5] to approximate marginals (quantization). The neural network (NN) implementation is based on penalization with . For the reproducing kernel Hilbert space solution (RKHS) as described in [28, Algorithm 3] we use a Laplace kernel and the same penalization as for the NN-method. For the final two rows, we report numerical values for . DNC entries did not converge. The final column are analytical reference values given by the comonotone coupling for two marginals.

The aim of this example is to compare the approach of this paper with existing methods for a numerically challenging problem. Let , where denotes the number of marginals and denotes the dimension of each marginal. Let for be

-mixtures of normal distributions with randomly chosen parameters, and define

. For let

where we write with , . Note that for two marginals, one has , where is the Wasserstein- distance with norm on .

In Table 1 we compare optimal values to this problem arising from different algorithmic approaches. We compare linear programming methods based on discretization of the marginals, the neural network method presented in this paper, and a reproducing kernel Hilbert space (RKHS) approach as described in [28, Algorithm 3]. For the linear programming methods, we use a maximal number of variables of , which was around the boundary so that Gurobi [32]

was still able to solve the resulting linear program on our computer. Regarding the RKHS algorithm we have to mention that it is the only method that is not building on an established package like Tensorflow or Gurobi. Hence efficiency with respect to running time and tuning of hyperparameters might be far from optimal for this approach. Notably, switching from exponential penalization as used in

[28] to -penalization was already a slight improvement. For the precise specifications of each algorithm and the problem setting, we refer to the code on

Evaluating the results, we find that the neural network based method and the linear programming method with quantization appear to work best. Surprisingly, even for the case with 10 marginals (where the linear program can only use 4 points to approximate each marginal!), the quantization method achieved a similar value as the neural network method for . We believe the reason is that the function is very smooth which a quantization approach can exploit. Hence we slightly changed to in the final two test cases, which makes the function less regular. These are the only cases where the neural network solution and the quantization method strongly differ. In the final case, the quantization approach still has to approximate each marginal distribution with just 4 points, while the neural network method can use millions of points. From this standpoint, one can place higher trust in the neural network solution, even though we have no analytical reference value to compare it against.

Initially, we included a fourth method based on the approach in this paper, but with a polynomial basis instead of neural networks. This performed very badly however (at least when using a standard monomial basis), and hence we omitted the results in this table.

4.3 Martingale optimal transport

In martingale optimal transport, the optimal transport problem is extended by imposing a martingale constraint on top of marginal constraints. Dimensions are regarded as discrete time-steps and the measures in are distributions of discrete stochastic processes with fixed marginal distributions as well as the condition that the process is a martingale.

Here, we consider a simple example with , where an analytical solution is known. This example is taken from [2]. Let , and set