Image morphing of two input images is a visual effect in which a sequence of images is obtained, transforming one image into the other. By denoting the input images as , the objective is to find a sequence of images that transform to . Generally, there are infinite possible ways of transforming one image into the other. Nevertheless, a pleasant transition should uphold the following properties. First, the difference between any two consecutive frames should be quite similar, leading to a smooth steady-paced animation. Second, the overall variation in the entire transition should be minimal, avoiding unnecessary changes.
The naive solution to consider for image morphing is a simple linear interpolation between the two images, i.e. . While this method indeed produces a smooth transition, it leads to unnatural intermediate samples that contain unpleasant double-exposure artifacts. Therefore, to obtain a pleasant transition, an additional requirement is needed.
. The Wasserstein barycenter is the probability distribution function that minimizes the mean of its Wasserstein distances to each element in a given set of probability distributions. Considering two input probability distributions located on the simplex the WBP is then defined as
where and denotes the Euclidean Wasserstein distance between and (see Section 3). To obtain a sequence that morphs the distribution to smoothly, a common approach is to solve Equation (1) for a linear series of values, e.g. . Indeed, solving the WBP for two input images, leads to a smooth (regular) and direct transition while avoiding ghosting artifacts.111This usually requires a pre-processing normalization step. That said, the intermediate samples do not necessarily seem “natural” as can be seen in Figures (a)a and (b)b. To overcome this issue, one may replace the Euclidean metric with the geodesic distance over the manifold of natural images. However, this manifold is typically unknown or very complex, making this approach impractical.
maps vectorsfrom a low dimensional latent space to high dimensional images. When two images and their matching latent representations are given, i.e. , a transition is obtained by linearly interpolating the two latent vectors as follows:
with . Since each interpolated image is an output of the generative network, each image follows an image prior, leading to a natural-looking transition. However, as we show in this work, these transitions do not necessarily obey the desired properties mentioned earlier. First, the pace of the changes might vary throughout the transformation as demonstrated in Figures (b)b and (c)c, where most of the transition is concentrated in one or two frames. Second, the change itself might not be direct and minimal. For example, in Figure (c)c the colors become too bright before darkening back again.
The main contribution of this work is in providing a novel algorithm for solving a constrained form of the WBP. Furthermore, in this work we introduce a novel approach for image morphing that is based on the Euclidean WBP but with additional constraints on the obtained intermediate images. Concretely, we enforce each of the images in the sequence to reside on the manifold of natural images by using image priors, leading to a transition that fulfills all of the aforementioned requirements. Moreover, we present an approach to measure these three properties numerically and show the advantage of our method.
2 Previous Work
Image morphing has been studied and evolved for over three decades. Classical methods  have relied on a simple cross-dissolve operation together with a geometric warp of two images, using a dense correspondence map, which is typically hard to obtain automatically. In some cases, however, manual correspondence maps were avoided. For example, in , a method that is based on optimal-transport was suggested to obtain a short time domain interpolation, e.g. interpolating two consecutive video-frames. A recent work has suggested a morphing process in which the intermediate images are generated by patches of the input ones, constraining their similarity . This method does not require such maps even when the input images differ substantially. Later,  has extended this concept by generalizing the local patch-based constraints to a single global one. To morph one image into the other, the authors have suggested to traverse the manifold of natural images. To this end, they first project the input images onto the latent space of a trained GAN, and then linearly interpolate these latent vectors. This transformation is used to compute a motion and color flow, which is then applied on one of the inputs. Hence, the final transformation is actually a geometrical warp of the input image, as opposed to being generated by the model. In this work, we further extend this approach by traversing the latent space in a non-linear manner. This path is obtained by solving the Wasserstein barycenter problem for the input images. Moreover, we discard the use of the flow fields by increasing the resolution of the generated images. Furthermore, our method is not restricted to GANs and can be used with any image prior.
3 The Wasserstein Barycenter Problem
3.1 Symbols and Notations
We define as a space created by regular samples in , leading to an grid of pixels, where . For simplicity, we refer to as a 1-dimensional vector of size . The symbol denotes the space of probability measures defined on , i.e. if , then and where the element is mapped to the -th pixel in . Finally, we use to denote the Euclidean distance between pixels and in the grid defined by .
3.2 Optimal Transport
Given a source and target distributions , it is possible to transform one to the other using a transportation-plan . This transportation plan describes the amount of mass to be passed from each pixel in to each pixel in , while preserving mass conservation rules. The set containing all possible plans is defined as:
where is an all-ones vector of size . For a given cost matrix , optimal transport is defined as the transportation plan which is the minimizer of:
Specifically, when the matrix is a distance matrix, then is referred to as a Wasserstein distance. For example, when the Euclidean distance is used, Eq. (4) is equivalent to:
and we denote . Indeed, as the name suggests, the Wasserstein distance is also a distance metric.
To find the minimizer of Eq. (5), one needs to solve a LP problem of size . For example, an image of size leads to a LP of size , making it an impractical task. To overcome this issue, we seek to approximate problem (5). A common approximation is the use of an entropic regularization :
This regularization stabilizes the solution by making the problem strictly convex and the solution can be found efficiently using the Sinkhorn algorithm . Hereinafter, we denote as the entropic-regularized Wasserstein distance.
3.3 Wasserstein Barycenters
For any given distance metric , the barycenter of a set of inputs and corresponding weights where and is defined as:
where . Specifically, the Wasserstein barycenter problem is defined as the probability measure that minimizes the sum of -powered Wasserstein distances to a set of probability measures :
where in (9) we chose . This problem is strictly convex and various efficient solvers have been suggested [6, 7, 27, 1]. Wasserstein barycenters have been used for various applications in image processing and shape analysis, including texture mixing , color transfer [12, 27] and shape interpolation . In the following section, we propose a novel solution for a constrained version of the Wasserstein barycenter problem, and use it to obtain a natural-looking barycenter of images.
4 The Proposed Approach
To morph one image to the other, while obtaining natural looking intermediate images, we suggest to restrict the obtained images to satisfy some prior. Formally, we add a constraint to the barycenter problem (Eq. (1)) that limits the result to lie on a manifold :
As this problem might be hard to solve directly, we shall place an auxiliary variable :
The Augmented Lagrangian of this problem is
where . This optimization problem can be solved using the Alternating Direction Method of Multipliers (ADMM) , leading to the following steps (see Algorithm 1). First, we find a solution to a regularized version of the WBP. This problem is strictly convex and has been previously studied. In our work we follow  which proposes a descent algorithm on the dual problem. The second step is a projection of the previous step result onto the manifold . The third and the final step is a simple update of the dual variable . These steps are repeated until convergence is achieved. Figure 3 illustrates the differences between our approach and other image morphing approaches, specifically when using a GAN as an image prior.
In cases where the manifold is convex, the optimization problem (12) is convex, and convergence to a global minimum is guaranteed. That said, manifolds of interest, such as those of natural images, are often not convex (otherwise a simple linear interpolation between images would suffice) and therefore, only a local minimum is not guaranteed. Nevertheless, as we show in section 6, the obtained results are visually appealing. Note that this approach can be applied for a variety of priors, affecting only the second step in Algorithm 1. In the following subsections we demonstrate our method on the sparse prior and on GANs.
4.1 Sparse Prior
A well-known prior for various signal processing tasks is the sparse representation prior [11, 3, 10]. This model assumes that a signal is constructed by a linear combination of only a few columns, also referred to as atoms, taken from a fixed matrix , known as a dictionary. When a signal is given, projecting it onto the model consists of finding its sparse representation vector :
for some , typically much smaller than . Generally, Eq. (13) is NP-hard  and various approximation algorithms have been suggested to solve this problem, such as the Orthogonal Matching Pursuit (OMP) and the Basis Pursuit (BP) algorithms [19, 4]. Once the representation vector is found, the reconstructed signal is simply .
In our approach, constraining the resulting barycenter image to satisfy the sparse representation prior, changes the second step in Algorithm 1
to a sparse coding algorithm, e.g. OMP. In our experiments, we further improve the visual results by approximating the MMSE estimator ofusing stochastic resonance .
4.2 Generative Adversarial Networks
In the GAN setting [13, 22], a generative network and a discriminative one contest against each other. Given a dataset, the former is trained to generate samples from it when given a random input vector , while the latter is trained to distinguish the generated data from the original one. This approach leads to a model that is able to generate new data samples with statistical properties that are similar to the training set, by sampling random vectors from the latent space of the model, and passing them through .
In order to use the generative network for image morphing, an inverse mapping, i.e. a mapping from an input image to its latent representation vector , is required. To obtain this mapping, we follow the approach described in  that we now briefly describe here. Once the generative network is trained, we train an encoder , such that is similar to the input :
and features extracted from AlexNet
trained on ImageNet. This encoder-decoder scheme may be perceived as a projection of the input signal onto the manifold of natural images. Therefore, To use GAN as a prior in our approach, the second step in Algorithm 1 is implemented by a simple feedforward activation of the obtained encoder-decoder.
5 Quantifying the Desired Properties
Above, we described 3 desired attributes for a natural looking image transformation: (i) to be smooth (regular), i.e. change at a constant pace; (ii) to be as minimal and direct as possible, avoiding unnecessary changes; and (iii) to include natural-looking images. To quantitatively show the advantage of our approach over other alternatives, we propose to measure each of these attributes as follows:
Regularity – to evaluate the smoothness of a transition, we propose to measure the distance between every two consecutive frames, and then compute the standard deviation of these distances over the entire transition. A steady paced transition results in a very low standard deviation, whereas irregular changes in the transformation correspond to a high variance. Since the Euclidean norm does not fit to measure movements of pixels in the image, we adopt the Wasserstein distance for this task as it evaluates the minimal effort required to transport each pixel from one frame to the other.
Minimal – a minimal transition consists of a small number of pixel movements during the transformation process. As before, we adopt the Wasserstein distance for this task as it is a natural metric to quantify these movements. To evaluate the cost of the entire transition, we propose to average the Wasserstein distances between every two successive frames in the transformation.
Natural looking images – to evaluate the affinity of an image to the class of natural images, we first train an autoencoder on a training set drawn from the chosen dataset. Once this model is trained, we feed each of the images generated in the transformation through the model, and compute theirdistance to their own projection on the manifold characterized by the autoencoder.
We first demonstrate our method using the sparse representation model. From our experiments, the generative capabilities of this model seem inferior to those of newer alternatives such as GANs. Nevertheless, this example exposes the generality of our approach regarding the chosen prior. We start by learning a dictionary for the training set of the MNIST dataset , using online dictionary learning . Then, we randomly select two test images of the same digit and morph one to the other using Algorithm 1, as described in Section 4.1. For comparison, we show the results of the morphing process using the unconstrained Wasserstein barycenters. As demonstrated in Figure 4, constraining the barycenter outcomes to satisfy the sparse prior yields sharper images that look like real digits, whereas the Wasserstein barycenter approach has no such guarantee.
We continue our experiments with employing GAN as an image prior, as described in Section 4.2. Specifically, we use the DCGAN architecture . This prior is much more potent and is able to generate digit-like images, even when transforming between different digits. In this case, we experiment with barycenters of 4 input images. To do so, we modify Eq. (1) to include a convex combination of 4 Wasserstein distances, one from each source image. Figure 5 presents a comparison between our approach, the standard Wasserstein barycenters, and a bilinear interpolation of the latent vectors in the GAN setting. In contrast to the Wasserstein barycenters results, both our method and latent space bilinear interpolation produce natural digits. However, in the GAN setting, the pace is not consistent, leading to false barycenters. For example, in the first row, the image in the center does not look like an “average” of all the others, but is rather similar to the digit “1”, inserted at the top-left corner.
6.2 Extended MNIST
The Extended-MNIST dataset  contains English characters that are more complicated than digits, and using the sparse prior leads to unpleasant results. Therefore, to obtain natural looking images, we focus our experiment in the DCGAN setting. Figures 1 and 6 demonstrate transitions using Wasserstein barycenters, linear interpolation in the latent space, and our method employing the DCGAN as the image prior. As before, it can be observed that the morphs obtained by the latent space interpolation do not result in a steady paced transition. At the bottom right example, the ‘L’ character hardly changes over the entire morphing process and most of the transformation occurs at the last 2 steps, i.e. . Furthermore, in the second example from the top on the right-hand side, it seems that transition from an ‘r’ to a ‘J’ in the latent interpolation case is not as straightforward as in our approach. Regarding the Wasserstein barycenter, the outcome images are blurry and often do not look like real English characters.
In addition to the provided visual results, Table 1 presents the averaged evaluation of all three methods, using the metrics specified in Section 5, on 1500 randomly chosen image pairs. These results show the advantage of our method as it obtained a straightforward and steadier paced transition compared to a latent space linear interpolation, while still being close to the desired manifold, as opposed to the Wasserstein barycenters approach.
|Method||Regularity||Total Dist.||Dist. to Manifold|
6.3 Shoe Images – UT Zappos50K
The Zappos50K  is a much more complicated dataset compared to the previous two. Specifically, it contains more details and higher resolution. To train a GAN capable of generating such images, we split the training process in two, somewhat similar to the training scheme described in StackGAN . First, we downscale the images to , and train a DCGAN model, as well as an encoder, as described in Section 4.2. The output images of this model are very smooth and lack fine high-frequency details. To add these details, we train an additional generative model. To this end, we generate a dataset of input-output image pairs as follows: the input images are the output of the encoder-decoder scheme, upsampled to , whereas the output images are the original ones downscaled to
. This dataset is used as a training set to a pix2pix model. To summarize, our projection scheme consists of the following stages: (i) project the input image to the DCGAN’s latent space using a trained encoder; (ii) generate a low-frequency image using a DCGAN; (iii) upscale the image to ; and (iv) feed the image into a pix2pix model to generate high frequency details.
Once the models are trained, we compare the three following methods. The first is a standard Wasserstein barycenter solution (applied on each color channel separately). The second approach is our proposed algorithm, i.e. project each of the Wasserstein barycenters to the manifold of natural images, using the trained generative models, and the third is the common transition achieved using GANs as follows. Each input image is projected onto the latent space, using the encoder. Then, to obtain a transition, we linearly interpolate the two latent vectors and pass the interpolated vectors through the DCGAN and pix2pix models. From our experiments, iterating once produces the best results. The results of our experiments are presented in Figures 2 and 7. Both our method and the GAN alternative provide natural images most of the time. Furthermore, in cases where the two input images are similar in shape and color the difference between the two approaches seems mild (see the last example in Figure 7). However, when the contour or the hue of the two input images differ significantly, our approach brings on a much more steady and straightforward transition to both the shape and the colors of the images.
In this work we introduced a novel solution to a constrained variant of the well-known Wasserstein barycenter problem. While our algorithm is general, we propose to use it to obtain a natural barycenter (average) of two or more input images, which can be used to generate a smooth transition from one to the other. For this purpose, we suggest constraining the barycenter to an image prior. Specifically, we demonstrate our approach using the sparse prior and generative adversarial networks. We compare our method with the unconstrained variant of the WBP and a linear interpolation of latent vectors of GANs and show the advantage of the former in terms of the smoothness of the transition, the minimal quantity of changes, and the natural look of the acquired images both visually and numerically. Moreover, we believe our approach of solving the WBP in its constrained form can be used in a variety of applications other than image morphing, e.g. pitch interpolation of two speakers, image style transfer and more, and we will focus our future work on such extensions.
-  (2016) Wasserstein barycentric coordinates: histogram regression using optimal transport.. ACM Transactions on Graphics 35 (4), pp. 71–1. Cited by: §3.3.
Distributed optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine learning3 (1), pp. 1–122. Cited by: §4.
-  (2009) From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM review 51 (1), pp. 34–81. Cited by: §4.1.
-  (1994) Basis pursuit. In Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, Vol. 1, pp. 41–44. Cited by: §4.1.
-  (2017) EMNIST: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373. Cited by: Figure 1, §6.2.
-  (2014) Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pp. 685–693. Cited by: §1, §3.3.
-  (2016) A smoothed dual approach for variational wasserstein problems. SIAM Journal on Imaging Sciences 9 (1), pp. 320–343. Cited by: §1, §3.3, §4.
-  (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp. 2292–2300. Cited by: §3.2.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §4.2.
-  (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing 15 (12), pp. 3736–3745. Cited by: §4.1.
-  (2010) Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media. Cited by: §4.1.
-  (2014) Regularized discrete optimal transport. SIAM Journal on Imaging Sciences 7 (3), pp. 1853–1882. Cited by: §3.3.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §4.2.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §1, §6.3.
Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §4.2.
-  (2010) MNIST handwritten digit database. AT&T Labs 2, pp. 18. Cited by: §4.2, §6.1.
-  (2010) Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research 11 (Jan), pp. 19–60. Cited by: §6.1.
-  (1995) Sparse approximate solutions to linear systems. SIAM Journal on Computing 24 (2), pp. 227–234. Cited by: §4.1.
-  (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar conference on signals, systems and computers, pp. 40–44. Cited by: §4.1.
-  (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §3.
-  (2011) Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435–446. Cited by: §3.3.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, Cited by: §1, §4.2, §6.1.
-  (1985) The wasserstein distance and approximation theorems. Probability Theory and Related Fields 70 (1), pp. 117–129. Cited by: §1.
-  (2010) Regenerative morphing. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 615–622. Cited by: §2.
-  (2019) MMSE approximation for sparse coding algorithms using stochastic resonance. IEEE Transactions on Signal Processing 67 (17), pp. 4597–4610. Cited by: §4.1.
-  (1967) Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly 74 (4), pp. 402–405. Cited by: §3.2.
-  (2015) Convolutional wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics 34 (4), pp. 66. Cited by: §3.3.
-  (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §3.
-  (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine 26 (1), pp. 98–117. Cited by: item 1.
-  (1998) Image morphing: a survey. The visual computer 14 (8), pp. 360–372. Cited by: §2.
-  (2014) Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199. Cited by: Figure 2, §4.2, §6.3.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915. Cited by: §6.3.
-  (2016) Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Cited by: §1, §2, §4.2.
-  (2007) An image morphing technique based on optimal mass preserving mapping. IEEE Transactions on Image Processing 16 (6), pp. 1481–1495. Cited by: §2.