The goal of artificial intelligence is to understand and analyze the world around us by inferring from acquired data. Given a set of data points, data interpolation or extrapolation aims at predicting novel data points between given samples (interpolation) or predicting novel data outside the sample range (extrapolation). Faithful data interpolation between sampled data points can be seen as a measure of the generalization capacity of a learning system(Berthelot et al., 2018)
. In the context of computer vision and computer graphics, data interpolation may refer to generating novel views of an object between two given views or predicting in-between animated frames from key frames.
Interpolation that produces novel views of a scene may require the geometric and photometric parameters of the object, the camera parameters and additional scene components, such as lighting and the reflective characteristics of nearby objects. Unfortunately, these characteristics are not always available or are difficult to extract in real-world scenarios. Thus, in this case, it is useful to apply data-driven interpolation. That is, interpolation that is deduced based on a sampled set of instances. In computer graphics, it is common to deal with two types of renderings: Model-based rendering, that requires geometric and photometric characteristics, and Image-based rendering (IBR) where a scene is represented using a large set of images acquired under various viewing and lighting conditions (Shum et al., 2008)
. In image-based rendering, the stored images are used to generate novel views, typically by applying geometric and projective models. In this paper, our goal is to generate novel views where the interpolations are generated using machine learning methodologies.
The task of data interpolation aims at extracting a new set of samples (possibly continuous) between known acquired data samples. Clearly, linear interpolation between two images in the input (image) domain does not work as it produces a cross-dissolve effect between the intensities of the two images. Adopting the manifold view of the data samples (Goodfellow et al., 2016; Verma et al., 2018; Bengio et al., 2013)
, this task can be seen as sampling data points on the geodesic path between the given points. The problem is that this manifold is unknown in advance and one has to approximate it from the given data. Alternatively, adopting the probabilistic perspective, interpolation can be viewed as drawing samples from highly probable areas in the data space.
One of the fascinating properties of unsupervised learning is the ability of the network to reveal the underlying characteristics, determinants or factors of a discrete or continuous dataset. Autoencoders(Doersch, 2016; Kingma and Welling, 2013) represent an effective approach for exposing these factors. Autoencoders have demonstrated the ability to interpolate by decoding a convex sum of latent vectors (Shu et al., 2018). However, this interpolation often incorporates visible artifacts during reconstruction.
To illustrate the problem, consider the following toy example: a scene is composed of a vertical pole at the center of a flat plane. A single light source illuminates the scene and its direction can vary along the upper hemisphere. Hence, the underlying parameters controlling the generated scene are , the elevation and the azimuth, respectively. The interactions between the light and the pole produce a cast shadow projected onto the plane with direction and length dictated by the light direction. A set of images of this scene is acquired from a fixed viewing position (from above) with various lighting directions. Our goal in this example is to train a model that is capable of interpolating between two given images using the given set of acquired images. Figure 1, top row, depicts a set of interpolated images where the interpolation is performed in the input domain. As illustrated, the interpolation is not natural as it produces cross-dissolve effects in image intensities. Training a standard autoencoder and applying linear interpolation in its latent space generates images that are much more realistic, as shown in Figure 1, bottom row. Nevertheless, this interpolation is not perfect as visible artifacts occur during the reconstruction. The source of these artifacts can be investigated by closely inspecting the 2D manifold embedded in the latent space.
Figure 2 shows the obtained manifold in the latent space, with data embedded into 3D (upper-left) and 2D (upper-right) latent spaces. The grid lines represent the parameterization. It can be seen that the encoder produces a non-smooth surface in 3D and a highly irregular manifold in 2D. Thus, linear interpolation between two data points may produce in-between points that leave the manifold. In practice, the decoded images of such points are unpredictable and may produce non-realistic artifacts. This is visualized in Figure 3. Reliable results are produced when the interpolation points lie on the manifold (top) but when the interpolation points depart from the manifold, the resulted image is unfaithful and includes unrealistic artifacts (bottom).
In this paper, we argue that the common statistical view of autoencoders is not appropriate for data generated from continuous factors and one cannot make a precise inference on such data relying only on statistical tools. We suggest that, for better results, the manifold structure of continuous data must be considered, taking into account the geometry of the manifold. If the data is governed by -dimensional continuous vectors then it can be viewed as a -dimensional manifold embedded in high dimensional space (either in latent space or data space). Accordingly, we propose two data interpolation techniques as follows. If the sampled data is available with labels in the form of the underlying generating parameters (self supervised), we propose the convexity loss that penalizes interpolated latent vectors that differ from a convex latent space. Using our convexity loss, the latent space is shaped to be a linear manifold, thus linearly interpolated points are all located on the manifold. When data is available without labels, that is, without the values of the determinant k-dimensional vectors (unsupervised), convexity loss cannot be applied. In such unsupervised cases, we propose the adversarial loss where the interpolation is optimized against a discriminator that learns to differentiate between real and interpolated data points, along with a cycle consistency loss
that encourages the latent representation of in-between points to be consistent with their decoded results. This combined loss function encourages the autoencoder to produce bijective mappings that avoid abrupt parameterization jumps in the latent space.
2 Previous Work
Generative models suggest an attractive way of learning data distributions using unsupervised learning. These techniques have become very popular in recent years, demonstrating successful results. In particular, two approaches for generating new data points have become common. The Variational Autoencoder (VAE) (Kingma and Welling, 2013)
and the Generative Adversarial Network (GAN)(Goodfellow et al., 2014). GAN aims at generating new samples drawn from the same distribution of a given dataset. One of the appealing properties of GAN is that it can produce reliable results even with small samples of data points (Gurumurthy et al., 2017). Although the GAN approach is highly attractive, it cannot be used directly for data interpolation due to three shortcomings: first, GAN is trained to generate random samples from the learned distribution. Thus, the generator learns a mapping from a random distribution to the desired distribution. To interpolate between two real datapoints, we must map the datapoints back into the latent domain and apply the interpolation in the latent space. Such inverse mapping is not part of the GAN framework although some attempts have been made (Radford et al., 2015). Second, the latent space of GAN does not necessarily encode a smooth parameterization of the data. There is no guarantee that applying continuous smooth interpolation in the latent space will produce faithful results in the data space. Finally, GAN is known to suffer from mode-collapse phenomena (Srivastava et al., 2017), thus latent space representations of some training images are not necessarily available.
The other approach for generative models is the autoencoder (Doersch, 2016). In its simplest version, the autoencoder is trained to obtain a reduced representation of the input data while removing data redundancies and revealing the determinant characteristics or generating factors. The reduced space can be viewed as an efficient representation space in which data interpolation can be attempted. Formally, the autoencoder is composed of two parts: the encoder and the decoder , where is expected to be similar to . The latent vector is a vector whose dimension is typically much lower than that of the input space.
There are many improvements for the autoencoder that have been proposed over recent years, including new techniques designed for improved convergence and accuracy. These include introducing new regularization terms, new loss objectives (such as adversarial loss) and new network designs (Doersch, 2016; Kingma and Welling, 2013; Larsen et al., 2015; Makhzani et al., 2015; Vincent et al., 2010; Larsen et al., 2016). Other new autoencoder techniques provide frameworks that attempt to shape the latent space to be efficient with respect to factor disentanglement or to make it conducive to latent space interpolation (Kingma and Welling, 2013; Bouchacourt et al., 2017; Makhzani and Frey, 2017; Vincent et al., 2008; Yeh et al., 2016). Within this second category, the VAE was shown to be very successful in applying interpolation in the latent space.
The core idea of the VAE involves replacing the deterministic mapping from the input space to the latent space with a probabilistic mapping. This fact blurs out the resulting manifold in the latent space that corresponds to real data. Additionally, by adding a KL divergence term into the VAE loss, multi-modal distributions, such as MNIST (LeCun and Cortes, 2010), tend to cluster the modes together within the latent space (Dieng et al., 2018). Consequently, linearly interpolating between different modes in the latent space may provide pleasing results that smoothly transition between the modes. Unfortunately, this practice does not apply for data points whose generating factors are continuous (in contrast to multi-modal distributions) as the KL loss term tends to tightly fold the manifold into a compact space making it squeezed and wiggly. This phenomenon is demonstrated in Figure 2 (lower left image).
In order to address data distributions whose generating factors are continuous, (Berthelot et al., 2018) propose using a critic network to predict the interpolation parameter while an autoencoder is trained to fool the critic. The motivation behind this approach is that the interpolation parameter
can be estimated for badly-interpolated images, while it is unpredictable for faithful interpolation. While this approach might work for multi-modal data, it doesn’t seem to work for data sampled from a continuous manifold. In such a case, the artifacts and the unrealistic-generated data do not provide any hint about the interpolating factor.
In this paper, we argue that the common statistical view of autoencoders is not appropriate for data generated from continuous factors and that the geometric manifold structure of the latent representation must be considered. More specifically, in order to better deal with continuous data, the geometry of the manifold must be taken into account as points outside of the manifold produce non-realistic images that often include many artifacts and inaccuracies. Accordingly, we propose two approaches for improved data interpolation by shaping the embedded manifold in the latent space.
3 The Proposed Approach
In the following we propose two approaches for data interpolation that address two different use cases. If the underlying parameters are known (self supervised), we propose the convexity loss that penalizes interpolated latent vectors according to the corresponding parameters. When the parameters are unknown (unsupervised), we propose the adversarial loss along with the cycle-consistency loss. With the adversarial loss the autoencoder is optimized to produce ”realistic” interpolations, while the cycle-consistency loss encourages the mapping from input to latent space to be bijective. In the following we explain these proposed techniques.
3.1 Supervised Manifold Interpolation
Assume the data is labeled with the associated parameters that generated it. For example, using our synthetic example, if we are given images of a scene under varying single source illumination directions, then each image is given along with its light direction . Our proposed architecture (see Figure 4) forces the latent manifold to be linear with respect to the given generating parameters, by adding a linearity constraint to the autoencoder loss. Denote by a sample point generated with the parameter vector . Assume we are given three data points , and where is an intermediate point between and with respect to the parameter , i.e. where . These three images, and , are inputs to an autoencoder and are mapped into three points in the latent space, respectively: and . In order to force the latent space to be convex, we require that where
We apply this constraint by adding an additional loss to each triplet input that is given to the autoencoder. For each such triplet we require:
where is the reconstructed image after passing through the autoencoder network, is a reconstruction loss (binary cross-entropy or norm) and is the convexity loss encouraging the latent manifold to be linear. Finally, we sum the loss over all sampled triplets. Applying the convexity loss, as suggested in Equation 2, to the toy example illustrated above (illuminated pole) results with a flat manifold as presented in Figure 2 (lower right image). The generated manifold is flattened and accordingly, the reconstruction of interpolated points is faithfully generated as will be shown in Section 4. The wrap-around of the parameterization prevents the latent space to form a linear manifold. To avoid this effect, we ”cut” the entire parameter range into several pieces (charts) that jointly cover the entire parameter space, and map each part into a linear manifold. In our case, during training, the dataset is partitioned into two parts, each of which includes a continuous half of the hemisphere along the azimuth. Each part is mapped separately into a linear manifold in the latent space.
Additionally, we experiment with an alternative approach for supervised interpolation by substituting the convexity loss in Equation 2 with an image convexity loss. In this case, we encourage the decoding to be similar to . The total loss for a triplet then reads:
This approach enforces latent space linearity by regularizing the outputs of the decoder, but it does not operate directly on the latent space. As will be elaborated in Section 4, this approach did not provide any benefit in terms of the faithfulness of latent space interpolation compared to the regular autoencoder or the variational autoencoder.
3.2 Unsupervised Manifold Interpolation
In cases where the data is given without order or labels, convexity loss cannot be applied. In such cases we present adversarial interpolation where we train a discriminator to differentiate between real and interpolated data. For pairs of input data-points , we linearly interpolate between them in the latent space: where and we would like to look real and fool the discriminator . Additionally, we add a cycle consistency loss and encourage the latent representation of to be again, namely . Putting everything together we obtain:
where is a standard reconstruction loss, is a cycle-consistency loss that encourages the latent space to produce meaningful vectors, and is the standard discriminator loss which encourages the network to fool the discriminator so that interpolated images are indistinguishable from the data points. Finally, we sum the loss over all sampled pairs.
Figure 5 illustrates the theoretical motivation for the introduction of the three losses. As seen on the left plot in the figure, the images that lie on the data manifold in the image space (solid black curve) are mapped back reliably due to the reconstruction loss that directly penalizes the network if it fails to reconstruct the images from the latent vector. However, this loss does not directly affect the mapping of in-between points in the image space into points in the latent space, as visualized by the red arrows representing . Additionally, linearly interpolated points in the latent space (blue dashed line) have no constraint that maps them back into the image manifold thus producing authentic looking images (blue arrows in the plot representing ). Incorporating adversarial loss on the reconstruction of interpolated latent vectors will improve the ability of the decoder to map latent vectors back into the image manifold (blue arrows). However, as visualized in the middle plot in Figure 5, the encoder (red arrows) might map in-between images to latent vectors that are distant from the linear line in the latent space. Finally, adding the cycle consistency loss (right plot) forces the encoder-decoder architecture to map interpolated latent vectors to authentic-looking images while those reconstructions themselves are mapped back to the original points in the latent space. These two loss functions together promote bijective mapping (one-to-one and onto) while providing realistic reconstruction of interpolated latent vectors.
The proposed architecture is visualized in Figure 6. At each iteration, we sample two images from our dataset. The two images are encoded by the shared-weight encoder into , respectively. We sample uniformly and pass to , a non-learned layer, that calculates the linear interpolation in the latent space, namely . We then decode and calculate the reconstruction loss . We then decode and alternately provide the discriminator with a samples either from or from . Finally, we pass through the encoder to obtain for the cycle consistency loss and add the loss .
The chosen encoder architecture was VGG-inspired (Simonyan and Zisserman, 2014)
. We extract the features using convolutional blocks that start from 16 feature maps and gradually increase the number of feature maps to 128 at the last convolutional block. We then flatten the extracted features and pass them through fully connected layers until we reach our desired latent dimensionality. The decoder architecture was symmetrical to that of the encoder. We use max-pooling after each convolutional block and batch normalization with ReLU activations after each learned layer. For the COIL100 training set, we use a random 80-20 training-testing split for each experiment. During hyper-parameter optimization, we found thatproduces the best results. All experiments were performed using a single NVIDIA V100 GPU.
Evaluating the faithfulness of interpolation is often illusive. In the supervised case, where the exact parameterization and labels are known, for each interpolated image, , we can retrieve the corresponding ground truth image, , and use it for evaluation. However, defining a path between two images in the parameter space depends on the parameterization of the underlying factors governing the data which is unknown (parameters in our toy example are ). Since such a parameterization is not unique, there are multiple viable paths in the latent space that correspond to faithful interpolation between the datapoints created by . In the case where the latent manifold exhibits a certain parameterization in the latent space , it can be adopted in the parameter space to evaluate the fidelity of interpolated images.
We used our synthetic illuminated pole dataset in order to be able to visualize the resulting manifold in low dimensions. The images were rendered using the Unity game engine where all images are taken from a fixed viewing position (from above). A single illumination source rotates at intervals of 5 degrees along the azimuth at different altitudes, ranging from 45 to 80 degrees with respect to the plane in 5 degrees intervals. This dataset contains a total of 576 images. In order to test our methods against real images with complex geometric and photometric parameterization, we used the COIL-100 dataset (Nene et al., 1996) that contains color images of 100 objects. The objects were placed on a motorized turntable against a black background. The images were taken at pose intervals of 5 degrees resulting in a total of 72 images for each class.
4.1.1 Supervised Interpolation Evaluation
For the supervised case, we tested our approach against the synthetic dataset. We train an autoencoder using the latent convexity loss where the dimensionality of the latent space was 2D or 3D. Figure 2 (lower right image) presents the resulting manifold in 3D. It can be seen that the 2D manifold is flat and linear, satisfying the convexity constraints.
Given an interval in the parameter space defined by , for some fixed , we iterate over all such intervals and compare the reconstruction of the latent space interpolation starting from to , where are the images with the illumination parameters , respectively. In our evaluation, the interpolation consists out of 37 points taken along the vector where and the reconstruction error was obtained for each such point. The parameterization was linear in the parameter space.
We compare the following network architectures: regular autoencoder, variational autoencoder and our proposed methods using supervised manifold interpolation via latent convexity loss as described in Section 3.1. We also test the resulting images while applying the image convexity loss (Equation 3). Figure 7 shows the MSE of the interpolated images with respect to the ground-truth. It is demonstrated that while convexity loss in the image space does not yield a significant improvement in terms of interpolation faithfulness, latent space convexity loss considerably lowers the reconstruction error. The latent space convexity loss provides the lowest MSE in comparison to all other methods, and it does not harm the reconstruction of the boundary images . It is also demonstrated that data interpolation in 2D latent space outperforms interpolations in 3D latent space. This outcome stems from the real intrinsic dimensionality of the manifold. This number, however, is not necessarily known in real world scenarios.
Figure 8 demonstrates qualitative evaluation for our proposed architecture on our synthetic dataset. Each of the blocks represent interpolation between two fixed points . The first and second rows of each block correspond to a naive autoencoder and a variational autoencoder, respectively. Note that the leftmost and rightmost images, that correspond to the test set images and , exhibit realistic reconstruction while in-between reconstructions are unfaithful and contain unrealistic artifacts. The third and fourth rows corresponds to the supervised and unsupervised methods. It is shown that image reconstructions appear realistic and exhibit no artifacts. Note that the parameterization in both methods is different however it does not affect the faithfulness of interpolation.
4.1.2 Unsupervised Interpolation Evaluation
For the unsupervised case, we analyze our results using qualitative and quantitative measures. The bottom row of each block in Figure 8 shows the interpolation results on the synthetic dataset using our proposed method for the unsupervised case. It can be seen that we obtain realistic-looking reconstructions, comparable to the results we obtained for the supervised case. Note, however, that it is impossible to define a ”correct” geodesic path between two points since the parameterization of the dataset is redundant. Thus, interpolation results of our supervised and unsupervised methods are realistic and without artifacts yet produce different parameterization in latent space. In both cases, the autoencoders were trained using a three-dimensional latent space. Figure 10 visualizes the resulting latent manifold for the unsupervised case. The learned manifold is mostly flat however not as smooth as in the supervised case. Note that the different parameterization obtained in the unsupervised case does not allow us to accurately quantify the reconstruction error.
We tested the unsupervised interpolation method on real images from the COIL-100 dataset. Since those images exhibit complex geometric and reflectance variations, for the sake of faithful reconstruction, dimensionality of the latent space had to be increased to 256. Figure 9 compares the resulting images of latent interpolation on COIL-100 dataset using three methods: naive autoencoder, a variational autoencoder and our proposed unsupervised method. It is demonstrated that when the interpolated images approach the middle range between , the reconstruction capability of the standard autoencoder and the VAE gets significantly worse and unrealistic artifacts are produced. Our method considerably reduce those artifacts. Figure 11 demonstrates interpolation results using all methods on a COIL-100 object.
For a quantitative comparison, we fix an interval length, which is a multiplicative of 5 degrees, and calculate the reconstruction error against the available ground-truth images. In this experiment, we use an interval length of 80 degrees and interpolate 14 in-between images. The reconstruction error of interpolated images is presented in Figure 13
. It is demonstrated that our method reduces both the mean squared error and the standard deviation of the MSE for different alpha values. We then inspect the average reconstruction error on a range of intervals from 25 degrees to 70 degrees as presented at the bottom part of Figure13. Note that our proposed method is able to consistently reduce the reconstruction error of interpolated images even when the interval length increases.
4.2 Conclusion & Discussion
The problem of realistic and faithful interpolation in the latent spaces of generative models have achieved tremendous success in the last few years. We argue that generative approaches that deal with manifold data are not as common as multi-modal data, and this misinterpretation of manifold data harms the competence of generative models to deal with them successfully. In this work, we argue that the manifold structures of data generated from continuous factors should be taken into account. Our main contribution is generalizing our supervised approach for the unsupervised case by applying convexity regularization using adversarial and cycle consistency losses. Using this technique, we manage to drastically improve the fidelity of interpolated images using a small dataset of images taken from various viewing directions. In future work, we intend to further investigate latent spaces that exhibit non-uniform parameterization as visualized in Figure 12 by implementing a directional derivative loss along the interpolation vector. In addition, we intend to develop a well-defined metric for performance evaluation of generative models capable of faithful interpolation.
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
- Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543. Cited by: §1, §2.
- Multi-level variational autoencoder: learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841. Cited by: §2.
- Avoiding latent variable collapse with generative skip models.. CoRR abs/1807.04863. Cited by: §2.
- Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §1, §2, §2.
- Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
- DeLiGAN: generative adversarial networks for diverse and limited data.. In CVPR, pp. 4941–4949. External Links: Cited by: §2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2, §2.
- Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §2.
- Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1558–1566. External Links: Cited by: §2.
- MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §2.
- PixelGAN autoencoders. In Advances in Neural Information Processing Systems, pp. 1975–1985. Cited by: §2.
- Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §2.
- Object image library (coil-100. Technical report Cited by: §4.1.
- Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §2.
- Deforming autoencoders: unsupervised disentangling of shape and appearance. In The European Conference on Computer Vision (ECCV), Cited by: §1.
- Image-based rendering. Springer Science & Business Media. Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Cited by: §3.3.
- VEEGAN: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3308–3318. Cited by: §2.
- Manifold mixup: encouraging meaningful on-manifold interpolation as a regularizer. arXiv preprint arXiv:1806.05236. Cited by: §1.
Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §2.
- Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §2.
- Semantic facial expression editing using autoencoded flow. arXiv preprint arXiv:1611.09961. Cited by: §2.