1 Introduction
While scientific problems of interest continue to grow in size and complexity, managing uncertainty is increasingly paramount. As a result, the development and use of theoretical and numerical methods to reason in the face of uncertainty, in a manner that can accommodate large datasets, has been the focus of sustained research efforts in statistics, machine learning, information theory and computer science. The ability to construct a mapping which transforms samples from one distribution to another distribution enables the solution to many problems in machine learning.
One such problem is Bayesian inference, (Gelman ., 2014; Bernardo Smith, 2001; Sivia Skilling, 2006)
, where a latent signal of interest is observed through noisy observations. Fully characterizing the posterior distribution is in general notoriously challenging, due to the need to calculate the normalization constant pertaining to the posterior density. Traditionally, point estimation procedures are used, which obviate the need for this calculation, despite their inability to quantify uncertainty. Generating samples from the posterior distribution enables approximation of any conditional expectation, but this is typically performed with Markov Chain Monte Carlo (MCMC) methods
(Gilks, 2005; Andrieu ., 2003; Hastings, 1970; Geman Geman, 1984; Liu, 2008) despite the following drawbacks: (a) the convergence rates and mixing times of the Markov chain are generally unknown, thus leading to practical shortcomings like “sample burn in” periods; and (b) the samples generated are necessarily correlated, lowering effective sample sizes and propagating errors throughout estimates (Robert Casella, 2004). If we let be the prior distribution and the posterior distribution for Bayesian inference , then an algorithm which can transform independent samples from to , without knowledge of the normalization constant in the density of , enables calculation of any conditional expectation with fast convergence.As another example, generative modeling problems entail observing a large dataset with samples from an unknown distribution (in high dimensions) and attempting to learn a representation or model so that new independent samples from
can be generated. Emerging approaches to generative modeling rely on the use of deep neural networks and include variational autoencoders
(Kingma Welling, 2013), generative adversarial networks (Goodfellow ., 2014) and their derivatives (Li ., 2015), and autoregressive neural networks (Larochelle Murray, 2011). These models have led to impressive results in a number of applications, but their tractability and theory are still not fully developed. If can be transformed into a known and wellstructured distribution (e.g. a multivariate standard Gaussian), then the inverse of the transformation can be used to transform new independent samples from into new samples from .While these issues relate to the functional attractiveness of the ability to characterize and sample from nontrivial distributions, there is also the issue of computational efficiency. There continues to be an ongoing upward trend of the availability of distributed and hardwareaccelerated computational resources. As such, it would be especially valuable to develop solutions to these problems that are not only satisfactory in a functional sense, but are also capable of taking advantage of the everincreasing scalability of parallelized computational capability.
1.1 Main Contribution
The main contribution of this work is to extend our previous results on finding transport maps to provide a more general transportbased pushforward theorem for pushing independent samples from a distribution to independent samples from a distribution . Moreover, we show how given only independent samples from , knowledge of up to a normalization constant, and under the traditionally mild assumption of the logconcavity of , it can be carried out in a distributed and scalable manner, leveraging the technique of alternating direction method of multipliers (ADMM) (Boyd ., 2011). We also leverage variational principles from nonequilibrium thermodynamics (Jordan ., 19981) to represent a transport map as an aggregate composition of simpler maps, each of which minimizes a relative entropy along with a transportcostbased regularization term. Each map can be constructed with a complementary, ADMMbased formulation, resulting in the construction of a measure transport map smoothly and sequentially with applicability in highdimensional settings.
Expanding on previous work on the realworld applicability of these generalpurpose algorithms, we showcase the implementation of a Bayesian LASSObased analysis of the Boston Housing dataset (Harrison Rubinfeld, 1978) and a highdimensional example of using transport maps for generative modeling for the MNIST handwritten digits dataset (LeCun ., 1998).
1.2 Previous Work
A methodology for finding transport maps based on ideas from optimal transport within the context of Bayesian inference was first proposed in (El Moselhy Marzouk, 2012) and expanded upon in conjunction with more traditional MCMCbased sampling schemes in (Marzouk ., 2016; Parno Marzouk, 2014; Parno ., 2016; Spantini ., 2016).
Our previous work used ideas from optimal transport theory to generalize the posterior matching scheme, a mutualinformation maximizing scheme for feedback signaling of a message point in arbitrary dimension (Ma T.P., 2014; Ma Coleman, 2011; Tantiongloc ., 2017). Building upon this, we considered a relative entropy minimization formulation, as compared to what was developed in (El Moselhy Marzouk, 2012), and showed that for the class of logconcave distributions, this is a convex problem (Kim ., 2013). We also previously described a distributed framework (Mesa ., 2015) that we expand upon here.
In the more traditional optimal transportation literature convex optimization has been used to varying success in specialized cases (Papadakis ., 2014), as well as gradientbased optimization methods (Rezende Mohamed, 2015; JD. Benamou ., 2015; Jd. Benamou ., 2015). The use of stochastic optimization techniques in optimal transport is also of current interest (Genevay ., 2016). In contrast, our work below presents a specific distributed framework where extensions to stochastic updating have been previously developed in a general case. Incorporating them into this framework remains to be explored.
Additionally, there is much recent interest in the efficient and robust calculation of Wasserstein barycenters (center of mass) across partial empirical distributions calculated over batches of samples (Cuturi Doucet, 2014; Claici ., 2018). Wasserstein barycenters have also been applied to Bayesian inference (Srivastava ., 2015). While related, our work focuses instead on calculating the full empirical distribution through various efficient parameterizations discussed below.
Building on much of this, there is growing interest in specific applications of these transport problems in various areas (Arjovsky ., 2017; Tolstikhin ., 2017). These derived transport problems are proving to be a fruitful alternative approach and are the subject of intense research. The framework presented below is general purpose and could benefit many of the derived transport problems.
Excellent introductory and references to the field can be found in (Villani, 2008; Santambrogio, 2015).
The rest of this paper is organized as follows: in Section 2, we provide some necessary definitions and background information; in Section 3, we describe the distributed general pushforward framework and provide several details on its construction and use; in Section 4, we formulate a specialized version of the objective specifically tailored for sequential composition; in Section 5, we discuss applications and examples of our framework; and we provide concluding remarks in Section 6.
2 Preliminaries
In this section we make some preliminary definitions and provide background information for the rest of this paper.
2.1 Definitions and Assumptions
Assume the space for sampling is given by , a convex subset of
dimensional Euclidean space. Define the space of all probability measures on
(endowed with the Borel sigmaalgebra) as . If admits a density with respect to the Lebesgue measure, we denote it as .Assumption 1.
We assume that admit densities with respect to the Lebesgue measure.
This work is fundamentally concerned with trying to find an appropriate pushforward between two probability measures, and :
Definition 2.1 (Pushforward).
Given we say that a map pushes forward to (denoted as
) if a random variable
with distribution results in having distribution .Of interest to us is the class of invertible and “smooth” pushforwards:
Definition 2.2 (Diffeomorphism).
A mapping is a diffeomorphism on if it is invertible, and both and are differentiable. Let be the space of all diffeomorphisms on .
A subclass of these, are those that are “orientation preserving”:
Definition 2.3 (Monotonic Diffeomorphism).
A mapping is orientation preserving, or monotonic, if its Jacobian is positivedefinite:
Let be the set of all monotonic diffeomorphisms on .
The Jacobian can be thought of as how the map “warps” space to facilitate the desired mapping. Any monotonic diffeomorphism necessarily satisfies the following Jacobian equation:
Lemma 2.4 (Monotonic Jacobian Equation).
Let and assume they have densities and . Any map for which satisfies the following Jacobian equation:
(1) 
We will now concern ourselves with two different notions of “distance” between probability measures.
Definition 2.5 (KL Divergence).
Let and assume they have densities and . The KullbackLeibler (KL) divergence, or relative entropy, between and is given by
The KL divergence is nonnegative and is zero if and only if for all .
Definition 2.6 (Wasserstein Distance).
For with densities and , the Wasserstein distance of order two between and can be described as
(2) 
The following theorem will be useful throughout:
Theorem 2.7 ((Brenier, 1987; Villani, 2003)).
Under Assumption 1, can be equivalently expressed as
(3) 
and there is a unique minimizer which satisfies .
Note that this implies the following corollary:
Corollary 2.8.
For any satisfying creftype 1, there exists a for which , or equivalently, for which (1) holds.
3 KL Divergencebased PushForward
In this section, we present the distributed pushforward framework that relies on our previously published relative entropybased formulation of the measure transport problem, and discuss several issues related to its construction.
3.1 General PushForward
According to Lemma 2.4, a monotonic diffeomorphism pushing to will necessarily satisfy the Jacobian equation (1). Note that although we think of a map as pushing from to , we have written (1) so that appears by itself on the lefthand side, while is being acted on by on the righthand side. This notation is suggestive of the following interpretation: If we think of the destination density as an anchor point, then for any arbitrary mapping , we can describe an induced density for according to Eq. 1 as:
(4) 
With this notation, we can interpret as a parametric family of densities, and for any fixed , is a density which integrates to . We note that by construction, any necessarily pushes to : . We can then cast the transport problem as finding the mapping that minimizes the relative entropy between and the induced .
(5) 
This perspective is represented visually in Fig. 1.
If we again make another natural assumption:
Assumption 2.
admits a density such that:
We can expand Eq. 5 and combine with (4) to write:
(6)  
(7)  
(8) 
where in (6), is the Shannon differential entropy of , which is fixed with respect to ; (7) is by creftype 2 and Jensen’s inequality implying that and the nonnegativity of KL divergence; (8) is by combining with (4).
We now make another assumption for which we can guarantee efficient methods to solve for (5).
Assumption 3.
The density is logconcave.
Proof.
For any , we have that . For any we have that and . Since is strictly concave over the space of positive definite matrices (Boyd Vandenberghe, 2004), and by assumption is concave, we have that is a convex function of on . Existence of for which is given by Corollary 2.8. ∎
An important remark on this theorem:
Remark 1.
Theorem 3.1 does not place any structural assumptions on . It need not be logconcave, for example.
Beginning with Eq. 8 above, we see that Problem (GP) can then be solved through the use of a MonteCarlo approximation of the expectation, and we arrive at the following samplebased version of the formulation:
(9) 
where .
3.2 Consensus Formulation
The stochastic optimization problem in (9) takes the general form of:
From this perspective, can be thought of as a complicating variable. That is, this optimization problem would be entirely separable across the sum were it not for . This can be instantiated as a global consensus problem:
s.t. 
where the optimization is now separable across the summation, but we must achieve global consensus over . With this in mind, we can now write a global consensus version of (9) as:
(10) 
In this problem, we can think of each (batch of) sample as independently inducing some random through a function . The method proposed below can then be thought of as iteratively reducing the distance between each and the true by reducing the distance between each .
This problem is still over an infinite dimensional space of functions , however.
3.3 Transport Map Parameterization
To address the infinite dimensional space of functions mentioned above, as in (Mesa ., 2015; Kim ., 2013, 2015; Marzouk ., 2016) we parameterize the transport map over a space of multivariate polynomial basis functions formed as the product of many univariate polynomials of varying degree. That is, given some , we form a basis function of multiindex degree using univariate polynomials of degree as:
This allows us to represent one component of as a weighted linear combination of basis functions with weights as:
where is a set of multiindices in the representation specifying the order of the polynomials in the associated expansion, and denotes the component of the mapping. In order to make this problem finitedimensional, we must truncate the expansion to some fixed maximumorder .
We can now approximate any nonlinear function as:
where the size of the indexset, , and is a matrix of weights.
In order to avoid confusion and in the spirit of consensus ADMM as shown in Boyd . (2011), we introduce a consensus variable . With this, we can now give a finitedimensional version of (10) as:
(11)  
with:
where we have made explicit the implicit constraint that by ensuring that . We now provide two important remarks:
Remark 2.
In principle, any basis of polynomials whose finitedimensional approximations are sufficiently dense over will suffice. In applications where is assumed known, the basis functions are chosen to be orthogonal with respect to the reference measure :
Within the context of Bayesian inference, for instance, this greatly simplifies computing conditional expectations, corresponding conditional moments, etc.
(Schoutens, 2000).Remark 3.
When it is important to ensure that the approximation satisfies the properties of a diffeomorphism, we can project onto with solving a quadratic optimization problem, as discussed in Ensuring Diffeomorphism Properties of Parameterized Maps.
We also note that the polynomial representation presented above is chosen to best approximate a transport map, independent of a specific application or representation of the data (Fourier, wavelet, etc.). As mentioned in Remark 2 above, in principle any dense basis will suffice.
3.4 Distributed PushForward with Consensus ADMM
In this section we will reformulate (11) within the framework of the alternating direction method of multipliers (ADMM), and provide our main result, Corollary 3.2.
3.1 Distributed Algorithm
Using ADMM, we can reformulate (11) as a global consensus problem to accommodate a parallelizable implementation. For notational clarity, we write and . We then introduce the following auxiliary variables:
We can now write (10) as:
s.t.  
where in the feasible set, we have denoted the Lagrange multiplier that will be associated with each constraint to the right.
Although coordinate descent algorithms solve for one variable at a time while fixing the others and can be extremely efficient, they are not always guaranteed to find the globally optimal solution Wright (2015). Using the consensus formulation of ADMM above, we consider a problem formulation with the same global optimum which contains quadratic penalties associated with equality constraints in the objective function and constraints still imposed. The consensus formulation has the key property that its Lagrangian, termed the ”augmented Lagrangian” Boyd . (2011), can be globally minimized with coordinated descent algorithms for any . Note that when , the augmented Lagrangian is equivalent to the standard (unaugmented) Lagrangian associated with (11).
We can now raise the constraints to form the fullypenalized augmented Lagrangian as:
The key property we leverage from the ADMM framework is the ability to minimize this Lagrangian across each optimization variable sequentially, using only the most recently updated estimates. After simplification (details can be found in the Appendix), the final ADMM update equations for the remaining variables are: equationparentequation
(12a)  
(12b)  
(12c)  
(12d)  
(12e)  
(12f)  
(12g) 
We look first at the consensus variable . We can separate its update into two pieces: a static component , and an iterative component : equationparentequation
(13a)  
(13b) 
The consensus variable can then be thought of as averaging the effect of all other auxiliary variables, and forming the current best estimate for consensus among the distributed computational nodes.
The update is the only remaining minimization step that cannot necessarily be solved in closed form, as it completely contains the structure of the density. In its penalization, all other optimization variables are fixed:
The formulation of (12) has the following desirable properties:

Eq. 12g is a penalized
dimensionalvector convex optimization problem that entirely captures the structure of
. In particular, any changes to the problem specifying a different structure of will be entirely confined in this update; furthermore, algorithm designers can utilize any optimization procedure/library of their choosing to perform this update.
With this, we can now give an efficient, distributed version of the general pushforward theorem:
Corollary 3.2 (Distributed PushForward).
Remark 4.
ADMM convergence’s properties are robust to inaccuracies in the initial stages of the iterative solving process (Boyd ., 2011). Additionally several key concentration results provide very strong bounds for averages of random samples from logconcave distributions, showing that the approximation is indeed robust (Bobkov ., 2011, Thrm 1.1, 1.2).
The above framework, under natural assumptions, facilitates the efficient, distributed and scalable calculation of an optimal map that pushes forward some to some .
3.5 Structure of the Transport Map
An important consideration in ensuring the construction of transport maps is efficient is their underlying structure. In Section 3.3 we described a parameterization of the transport map through the multiindex set  the indices of polynomial orders involved in the expansion. However, this parameterization tends to be unfeasible to use in high dimension or with high order polynomials due to the exponential rate at which the number of polynomials increases with respect to these two properties.
In (Marzouk ., 2016), two less expressive, but more computationally feasible map structures that can be used to generate the transport map were discussed, which we briefly reproduce here, along with some useful properties. For more specific details and examples of multiindex sets pertaining to each mode for implementation purposes, see Transport Map MultiIndices Details
The first alternative to the map pertaining to the fullyexpressive mapping is the KnotheRosenblatt map (Bonnotte, 2013), which our group also previously used within the context of generating transport maps for optimal message point feedback communication (Ma Coleman, 2011). Here, each component of the output, , is only a function of the first components of the input, resulting in a mapping that is lowertriangular. Both the KnotheRosenblatt and dense mapping described above perform the transport from one density to another, but with different geometric transformations. An example of these differences can be found in Figures 3 and 4 of (Ma Coleman, 2011).
A KnotheRosenblatt arrangement gives the following multiindex set (note that the indexset is now subscripted according to dimension of the data denoting the dependence on data component):
An especially useful property of this parameterization is the following identity for the Jacobian of the map:
(15) 
where represents the partial derivative of the component of the mapping with respect to the component of the data, evaluated at .
Furthermore, the positivedefiniteness of the Jacobian can equivalently be enforced for a lowertriangular mapping by ensuring the following:
(16) 
We can then write a KnotheRosenblatt specialcase version of Eq. 10 as:
(17) 
Indeed, we use this to our advantage in Section 4.
Finally, in the event that the KnotheRosenblatt mapping also proves to have too high of model complexity, an even less expressive mapping is a KnotheRosenblatt mapping that ignores all multivariate polynomials that involve more than one data component of the input at a time, resulting in the following multiindex set:
Although less expressive and less precise than the total order KnotheRosenblatt map, these maps can often still perform at an acceptable level of accuracy with respect to many problems.
3.6 Algorithm for Inverse Mapping with KnotheRosenblatt Transport
It may be desirable to compute the inverse mapping of a given sample from , that is, . When the forward mapping is constrained to have KnotheRosenblatt structure, and a polynomial basis is used to parameterize the mapping, the process of inverting a sample from reduces to solving a sequential series of polynomial rootfinding problems (Marzouk ., 2016). We give a more detailed implementationbased explanation of this process alongside a discussion of implementation details for the KnotheRosenblatt maps in Inverse Map Details.
4 Sequential Composition of Optimal Transportation Maps
In this section, we introduce a scheme for using many individually computed maps in sequential composition to achieve an overall effect of a single large mapping from to . By using a sequence of maps to transform to instead of a single oneshot map, one can theoretically rely on models of lower complexity to represent each map in the sequence, as although each map is, on its own, “weak” in the sense of its ability to induce large changes in the distribution space, the combined action of many such maps together can potentially successfully transform samples as desired. This is especially attractive for model structures that increase exponentially in complexity with problem size, such as the dense polynomial chaos structure discussed on the previous section. This sequential composition process is visually represented in Figure 2.
Moving forward, we first take a brief look at a nonequilibrium thermodynamics interpretation of this methodology to further justify the use of such a scheme, and then derive a slightly different ADMM problem to implement it.
4.1 NonEquilibrium Thermodynamics and Sequential Evolution of Distributions
One approach to interpreting sequential composition of maps is to borrow ideas from statistical physics, where we can interpret as the equilibrium density ( of particles in a system, which at time is out of equilibrium with density (also termed ). Since is an equilibrium density, it can be written as a Gibbs distribution (with temperature equal to 1 for simplicity): . For instance, if pertains to a standard Gaussian, then . Assuming the particles obey the Langevin equation, it is well known that the evolution of the particle density as a function of time obeys the FokkerPlanck equation. It was shown in (Jordan ., 19982) that the trajectory of can be interpreted from variational principles. Specifically,
Theorem 4.1 ((Jordan ., 19982) Thm 5.1).
Define and and assume that . For any , consider the following minimization problem:
(18)  
(19) 
Then as
, the piecewise constant interpolation which equals
for converges weakly in to , the solution to the FokkerPlanck equation.The logconcave structure of we have exploited previously also has implications for exponential convergence to equilibrium with this statistical physics perspective:
Theorem 4.2 ((Bakry Émery, 1985)).
Note that if is the density of a standard Gaussian, this inequality holds with .
4.2 Sequential Construction of Transport Maps
We now note that for any , (19) encodes a sequence of densities which evolve towards . For notational conciseness in this section, we will be using the subscript on to denote the position of the map in a sequence of maps. As such, from corollary Corollary 2.8, there exists an for which , and more generally, for any , there exists an for which .
Lemma 4.3.
Define as
(20) 
Then and .
Proof.
From the definition of in (4) and the invariance of relative entropy under an invertible transformation, any satisfies
As such, moving forward with the proof, we will exploit how where
From Theorem 2.7, for any . Also, since the relative entropy terms of and are equal, it follows that for any . Moreover, from Corollary 2.8, we have that there exists an for which and
Thus . ∎
As such, a natural composition of maps underlies how a sample from gives rise to a sample from :
(21) 
Moreover, since as , and so approaches the identity map. Thus for small , each should be estimated with reasonable accuracy using lowerorder maps. That is, can be described as the composition of maps as
(22) 
for all , such that each is of relative loworder in the polynomial chaos expansion.
Note that as written above involves a sum of expectations with respect to . Since our scheme operates sequentially, we have already estimated and can generate approximate i.i.d. samples from by first generating i.i.d. from and constructing
We below will demonstrate efficient ways to solve the below convex optimization problem which replaces the expectation with respect to instead with the empirical expectation with respect to .
To reiterate, we consider a distribution formed by the sequential composition of previous mappings as
where . We then try to find a map that pushes forward closer to . Each is solved by the optimization problem (20), which we term SOT. As the number of compositions in (22) increases, approaches . When is uniform logconcave, this greedy, sequential approach still guarantees exponential convergence.
In the context of KnotheRosenblatt maps, for every map in the sequence we can solve the following optimization problem (in the following equation, we will be dropping the subscript that indicates the sequential map index, as the formulation is not dependent on position in the map sequence, and we will once again be replacing the subscript with to indicate the distributed variables for the consensus problem instead):
(23)  
where can be interpreted as an inverse “stepsize” parameter.
Though each map in the sequence must be calculated sequentially after the previous one, each mapping can still be calculated in the distributed framework described above. This implies that at each round, one could adaptively decide the parameters for the nextround’s solve.
4.3 ADMM Formulation for Learning Sequential Maps
We now showcase an ADMM formulation for the optimal transportationbased objective function, similar in spirit to that of Eq. 12.
We first introduce the following conventions:

represents the partial derivative of taken with respect to the component. Therefore, , and is the component of .

represents a onehot vector of length with the one in the position
We can then introduce a finitedimensional representation of the transport map, as well as auxiliary variables and a consensus variable to Eq. 23 and rewrite the problem as:
(24) 
where we have once again denoted the corresponding Lagrange multipliers to the right of each constraint. The superscript notation represents the fact that in this formulation, in addition to having separable variables for each data sample, some variables are now unique to an index over dimension as well. For example, there are many variables that must be solved for. We can now raise the constraints to form the fullypenalized Lagrangian as:
(25) 
The final ADMM update equations for each variable are once again all closedform, with the exception of the optimization over . For the sake of brevity, we refer the reader to Section Derivation of KnotheRosenblatt ADMM Formulation and Final Updates of the Appendix for the exact update equations.
However, one notable difference between this formulation and that of Section 3.1 as noted in the previous section is that the update for
has been simplified from requiring an eigenvalue decomposition, to requiring a simple scalar computation, thus significantly reducing computation time, especially in higher dimensions.
4.4 Scaling Parallelization with GPU Hardware
Given the parallelized formulations given above, we implemented our algorithm using the Nvidia CUDA API to get as much performance as possible out of our formulation, and to maximize the problem sizes we could reasonably handle, while keeping computation time as short as possible. To test the algorithm’s parallelizability, we ran our implementation on a single Nvidia GTX 1080ti GPU, as well as on a single p3.16xlarge instance available on Amazon Web Services, which itself contains 8 onboard Tesla V100 GPUs.
For this test, we have sampled synthetic data from a bimodal
distribution specified as a combination of two Gaussian distributions, for a wide range of problem dimensions, specifically
, and a constant number of samples from set to . We then find a transport pushing to , composed of a sequence of 10 individual KnotheRosenblatt maps with no mixed multivariate terms. We then monitor the convergence of dual variables for proper termination of the algorithm.Figure 3 shows the result of this analysis. The 1 GPU curve corresponds to performance using the single GTX 1080ti, and the AWS curve corresponds to the performance using the 8GPU system on Amazon Web Services. The trending of the curves shows that, as expected, as problem dimension increases, a multiGPU system will continue to maintain reasonable computation times, at least with respect to a singleGPU system, however fewer GPU’s will begin to accumulate increasingly high computational costs. In addition, the parallelizability of our algorithm also has a subtle benefit of helping with memoryusage issues; since we can distribute samples across multiple devices, we can also subsequently distribute all corresponding ADMM variables as well. Indeed, the single GTX 1080ti ran out of onboard memory roughly around , whereas the 8GPU system can go well beyond that.
5 Applications
The framework presented above is generalpurpose, and works to pushforward a distribution to a logconcave distribution . Below we discuss some interesting applications, namely Bayesian inference and a generative model, and show results with realworld datasets.
5.1 Bayesian Inference
A very important instantiation of this framework comes when we consider to represent a prior distribution, and to be a Bayesian posterior:
where is a constant that does not vary with , given by:
Using Eq. 1 and combining with Bayes’ rule above we can write:
where the notation indicates that the optimal map is found with respect to observations . We note that since , logconcavity of is equivalent to logconcavity of the prior density and logconcavity of the likelihood density in : the same criterion for an MAP estimation procedure to be convex. Thus Corollary 3.2 extends to the special case of Bayesian inference; i.e. we can generate i.i.d. samples from the posterior distribution by solving a convex optimization problem in a distributed fashion.
Due to the unique way the ADMM steps were structured, this special case only requires specifying a particular instance of Eq. 12g: