Faithful Autoencoder Interpolation by Shaping the Latent Space

by   Alon Oring, et al.

One of the fascinating properties of deep learning is the ability of the network to reveal the underlying factors characterizing elements in datasets of different types. Autoencoders represent an effective approach for computing these factors. Autoencoders have been studied in the context of their ability to interpolate between data points by decoding mixed latent vectors. However, this interpolation often incorporates disrupting artifacts or produces unrealistic images during reconstruction. We argue that these incongruities are due to the manifold structure of the latent space where interpolated latent vectors deviate from the data manifold. In this paper, we propose a regularization technique that shapes the latent space following the manifold assumption while enforcing the manifold to be smooth and convex. This regularization enables faithful interpolation between data points and can be used as a general regularization as well for avoiding overfitting and constraining the model complexity.



page 1

page 2

page 7

page 8


Interventional Assays for the Latent Space of Autoencoders

The encoders and decoders of autoencoders effectively project the input ...

Learning low bending and low distortion manifold embeddings

Autoencoders are a widespread tool in machine learning to transform high...

Revisiting Latent-Space Interpolation via a Quantitative Evaluation Framework

Latent-space interpolation is commonly used to demonstrate the generaliz...

Spacetime Autoencoders Using Local Causal States

Local causal states are latent representations that capture organized pa...

Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer

Autoencoders provide a powerful framework for learning compressed repres...

Generating various airfoil shapes with required lift coefficient using conditional variational autoencoders

Multiple shapes must be obtained in the mechanical design process to sat...

Particle Filter Bridge Interpolation

Auto encoding models have been extensively studied in recent years. They...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of artificial intelligence is to understand and analyze the world around us by inferring from acquired data. Given a set of data points, data interpolation or extrapolation aims at predicting novel data points between given samples (interpolation) or predicting novel data outside the sample range (extrapolation). Faithful data interpolation between sampled data points can be seen as a measure of the generalization capacity of a learning system

(Berthelot et al., 2018)

. In the context of computer vision and computer graphics, data interpolation may refer to generating novel views of an object between two given views or predicting in-between animated frames from key frames.

Figure 1: Top row: cross-dissolve artifacts can be seen as a result of linear interpolation in the input space. Bottom row: the image reconstruction obtained by linear latent space interpolation of an autoencoder. Unrealistic artifacts are clearly introduced.

Interpolation that produces novel views of a scene may require the geometric and photometric parameters of the object, the camera parameters and additional scene components, such as lighting and the reflective characteristics of nearby objects. Unfortunately, these characteristics are not always available or are difficult to extract in real-world scenarios. Thus, in this case, it is useful to apply data-driven interpolation. That is, interpolation that is deduced based on a sampled set of instances. In computer graphics, it is common to deal with two types of renderings: Model-based rendering, that requires geometric and photometric characteristics, and Image-based rendering (IBR) where a scene is represented using a large set of images acquired under various viewing and lighting conditions (Shum et al., 2008)

. In image-based rendering, the stored images are used to generate novel views, typically by applying geometric and projective models. In this paper, our goal is to generate novel views where the interpolations are generated using machine learning methodologies.

The task of data interpolation aims at extracting a new set of samples (possibly continuous) between known acquired data samples. Clearly, linear interpolation between two images in the input (image) domain does not work as it produces a cross-dissolve effect between the intensities of the two images. Adopting the manifold view of the data samples (Goodfellow et al., 2016; Verma et al., 2018; Bengio et al., 2013)

, this task can be seen as sampling data points on the geodesic path between the given points. The problem is that this manifold is unknown in advance and one has to approximate it from the given data. Alternatively, adopting the probabilistic perspective, interpolation can be viewed as drawing samples from highly probable areas in the data space.

One of the fascinating properties of unsupervised learning is the ability of the network to reveal the underlying characteristics, determinants or factors of a discrete or continuous dataset. Autoencoders

(Doersch, 2016; Kingma and Welling, 2013) represent an effective approach for exposing these factors. Autoencoders have demonstrated the ability to interpolate by decoding a convex sum of latent vectors (Shu et al., 2018). However, this interpolation often incorporates visible artifacts during reconstruction.

To illustrate the problem, consider the following toy example: a scene is composed of a vertical pole at the center of a flat plane. A single light source illuminates the scene and its direction can vary along the upper hemisphere. Hence, the underlying parameters controlling the generated scene are , the elevation and the azimuth, respectively. The interactions between the light and the pole produce a cast shadow projected onto the plane with direction and length dictated by the light direction. A set of images of this scene is acquired from a fixed viewing position (from above) with various lighting directions. Our goal in this example is to train a model that is capable of interpolating between two given images using the given set of acquired images. Figure 1, top row, depicts a set of interpolated images where the interpolation is performed in the input domain. As illustrated, the interpolation is not natural as it produces cross-dissolve effects in image intensities. Training a standard autoencoder and applying linear interpolation in its latent space generates images that are much more realistic, as shown in Figure 1, bottom row. Nevertheless, this interpolation is not perfect as visible artifacts occur during the reconstruction. The source of these artifacts can be investigated by closely inspecting the 2D manifold embedded in the latent space.

Figure 2: Upper images: the latent space manifold of the synthetic data embedded in 3D (left) and 2D (right) learned by an autoencoder. Grid lines represent the parameterization. Lower images: the 3D latent space learned by the Variational Autoencoder (left) and our proposed supervised method (right).

Figure 2 shows the obtained manifold in the latent space, with data embedded into 3D (upper-left) and 2D (upper-right) latent spaces. The grid lines represent the parameterization. It can be seen that the encoder produces a non-smooth surface in 3D and a highly irregular manifold in 2D. Thus, linear interpolation between two data points may produce in-between points that leave the manifold. In practice, the decoded images of such points are unpredictable and may produce non-realistic artifacts. This is visualized in Figure 3. Reliable results are produced when the interpolation points lie on the manifold (top) but when the interpolation points depart from the manifold, the resulted image is unfaithful and includes unrealistic artifacts (bottom).

Figure 3: When the interpolated point is on the manifold surface (yellow circle), a faithful image is generated by the decoder (top). When the interpolated point departs from the manifold, the resulting image is unpredictable (bottom).

In this paper, we argue that the common statistical view of autoencoders is not appropriate for data generated from continuous factors and one cannot make a precise inference on such data relying only on statistical tools. We suggest that, for better results, the manifold structure of continuous data must be considered, taking into account the geometry of the manifold. If the data is governed by -dimensional continuous vectors then it can be viewed as a -dimensional manifold embedded in high dimensional space (either in latent space or data space). Accordingly, we propose two data interpolation techniques as follows. If the sampled data is available with labels in the form of the underlying generating parameters (self supervised), we propose the convexity loss that penalizes interpolated latent vectors that differ from a convex latent space. Using our convexity loss, the latent space is shaped to be a linear manifold, thus linearly interpolated points are all located on the manifold. When data is available without labels, that is, without the values of the determinant k-dimensional vectors (unsupervised), convexity loss cannot be applied. In such unsupervised cases, we propose the adversarial loss where the interpolation is optimized against a discriminator that learns to differentiate between real and interpolated data points, along with a cycle consistency loss

that encourages the latent representation of in-between points to be consistent with their decoded results. This combined loss function encourages the autoencoder to produce bijective mappings that avoid abrupt parameterization jumps in the latent space.

2 Previous Work

Generative models suggest an attractive way of learning data distributions using unsupervised learning. These techniques have become very popular in recent years, demonstrating successful results. In particular, two approaches for generating new data points have become common. The Variational Autoencoder (VAE) (Kingma and Welling, 2013)

and the Generative Adversarial Network (GAN)

(Goodfellow et al., 2014). GAN aims at generating new samples drawn from the same distribution of a given dataset. One of the appealing properties of GAN is that it can produce reliable results even with small samples of data points (Gurumurthy et al., 2017). Although the GAN approach is highly attractive, it cannot be used directly for data interpolation due to three shortcomings: first, GAN is trained to generate random samples from the learned distribution. Thus, the generator learns a mapping from a random distribution to the desired distribution. To interpolate between two real datapoints, we must map the datapoints back into the latent domain and apply the interpolation in the latent space. Such inverse mapping is not part of the GAN framework although some attempts have been made (Radford et al., 2015). Second, the latent space of GAN does not necessarily encode a smooth parameterization of the data. There is no guarantee that applying continuous smooth interpolation in the latent space will produce faithful results in the data space. Finally, GAN is known to suffer from mode-collapse phenomena (Srivastava et al., 2017), thus latent space representations of some training images are not necessarily available.

The other approach for generative models is the autoencoder (Doersch, 2016). In its simplest version, the autoencoder is trained to obtain a reduced representation of the input data while removing data redundancies and revealing the determinant characteristics or generating factors. The reduced space can be viewed as an efficient representation space in which data interpolation can be attempted. Formally, the autoencoder is composed of two parts: the encoder and the decoder , where is expected to be similar to . The latent vector is a vector whose dimension is typically much lower than that of the input space.

There are many improvements for the autoencoder that have been proposed over recent years, including new techniques designed for improved convergence and accuracy. These include introducing new regularization terms, new loss objectives (such as adversarial loss) and new network designs (Doersch, 2016; Kingma and Welling, 2013; Larsen et al., 2015; Makhzani et al., 2015; Vincent et al., 2010; Larsen et al., 2016). Other new autoencoder techniques provide frameworks that attempt to shape the latent space to be efficient with respect to factor disentanglement or to make it conducive to latent space interpolation (Kingma and Welling, 2013; Bouchacourt et al., 2017; Makhzani and Frey, 2017; Vincent et al., 2008; Yeh et al., 2016). Within this second category, the VAE was shown to be very successful in applying interpolation in the latent space.

The core idea of the VAE involves replacing the deterministic mapping from the input space to the latent space with a probabilistic mapping. This fact blurs out the resulting manifold in the latent space that corresponds to real data. Additionally, by adding a KL divergence term into the VAE loss, multi-modal distributions, such as MNIST (LeCun and Cortes, 2010), tend to cluster the modes together within the latent space (Dieng et al., 2018). Consequently, linearly interpolating between different modes in the latent space may provide pleasing results that smoothly transition between the modes. Unfortunately, this practice does not apply for data points whose generating factors are continuous (in contrast to multi-modal distributions) as the KL loss term tends to tightly fold the manifold into a compact space making it squeezed and wiggly. This phenomenon is demonstrated in Figure 2 (lower left image).

In order to address data distributions whose generating factors are continuous, (Berthelot et al., 2018) propose using a critic network to predict the interpolation parameter while an autoencoder is trained to fool the critic. The motivation behind this approach is that the interpolation parameter

can be estimated for badly-interpolated images, while it is unpredictable for faithful interpolation. While this approach might work for multi-modal data, it doesn’t seem to work for data sampled from a continuous manifold. In such a case, the artifacts and the unrealistic-generated data do not provide any hint about the interpolating factor.

In this paper, we argue that the common statistical view of autoencoders is not appropriate for data generated from continuous factors and that the geometric manifold structure of the latent representation must be considered. More specifically, in order to better deal with continuous data, the geometry of the manifold must be taken into account as points outside of the manifold produce non-realistic images that often include many artifacts and inaccuracies. Accordingly, we propose two approaches for improved data interpolation by shaping the embedded manifold in the latent space.

3 The Proposed Approach

In the following we propose two approaches for data interpolation that address two different use cases. If the underlying parameters are known (self supervised), we propose the convexity loss that penalizes interpolated latent vectors according to the corresponding parameters. When the parameters are unknown (unsupervised), we propose the adversarial loss along with the cycle-consistency loss. With the adversarial loss the autoencoder is optimized to produce ”realistic” interpolations, while the cycle-consistency loss encourages the mapping from input to latent space to be bijective. In the following we explain these proposed techniques.

3.1 Supervised Manifold Interpolation

Figure 4: Shaping the manifold with convexity loss. The weights of the encoder and the decoder are shared across the network. The dotted lines between the inputs and the reconstructions represent reconstruction loss () and the dotted lines between the latent vectors represent our convexity loss given by .

Assume the data is labeled with the associated parameters that generated it. For example, using our synthetic example, if we are given images of a scene under varying single source illumination directions, then each image is given along with its light direction . Our proposed architecture (see Figure 4) forces the latent manifold to be linear with respect to the given generating parameters, by adding a linearity constraint to the autoencoder loss. Denote by a sample point generated with the parameter vector . Assume we are given three data points , and where is an intermediate point between and with respect to the parameter , i.e. where . These three images, and , are inputs to an autoencoder and are mapped into three points in the latent space, respectively: and . In order to force the latent space to be convex, we require that where


We apply this constraint by adding an additional loss to each triplet input that is given to the autoencoder. For each such triplet we require:


where is the reconstructed image after passing through the autoencoder network, is a reconstruction loss (binary cross-entropy or norm) and is the convexity loss encouraging the latent manifold to be linear. Finally, we sum the loss over all sampled triplets. Applying the convexity loss, as suggested in Equation 2, to the toy example illustrated above (illuminated pole) results with a flat manifold as presented in Figure 2 (lower right image). The generated manifold is flattened and accordingly, the reconstruction of interpolated points is faithfully generated as will be shown in Section 4. The wrap-around of the parameterization prevents the latent space to form a linear manifold. To avoid this effect, we ”cut” the entire parameter range into several pieces (charts) that jointly cover the entire parameter space, and map each part into a linear manifold. In our case, during training, the dataset is partitioned into two parts, each of which includes a continuous half of the hemisphere along the azimuth. Each part is mapped separately into a linear manifold in the latent space.

Additionally, we experiment with an alternative approach for supervised interpolation by substituting the convexity loss in Equation 2 with an image convexity loss. In this case, we encourage the decoding to be similar to . The total loss for a triplet then reads:


This approach enforces latent space linearity by regularizing the outputs of the decoder, but it does not operate directly on the latent space. As will be elaborated in Section 4, this approach did not provide any benefit in terms of the faithfulness of latent space interpolation compared to the regular autoencoder or the variational autoencoder.

Figure 5: Left: using only the reconstruction loss guarantees faithful reconstruction of images that appear in the dataset, . Middle: adding adversarial loss improves the reconstruction quality of in-between latent codes, but does not affect the encoder mapping since in-between images can be mapped to latent vectors distant from the linear line in the latent space. Right: cycle consistency loss forces the mapping of interpolated images to map back to the same latent vector, thus directly affecting the shape of the manifold and promoting bijective mapping.

3.2 Unsupervised Manifold Interpolation

In cases where the data is given without order or labels, convexity loss cannot be applied. In such cases we present adversarial interpolation where we train a discriminator to differentiate between real and interpolated data. For pairs of input data-points , we linearly interpolate between them in the latent space: where and we would like to look real and fool the discriminator . Additionally, we add a cycle consistency loss and encourage the latent representation of to be again, namely . Putting everything together we obtain:


where is a standard reconstruction loss, is a cycle-consistency loss that encourages the latent space to produce meaningful vectors, and is the standard discriminator loss which encourages the network to fool the discriminator so that interpolated images are indistinguishable from the data points. Finally, we sum the loss over all sampled pairs.

Figure 5 illustrates the theoretical motivation for the introduction of the three losses. As seen on the left plot in the figure, the images that lie on the data manifold in the image space (solid black curve) are mapped back reliably due to the reconstruction loss that directly penalizes the network if it fails to reconstruct the images from the latent vector. However, this loss does not directly affect the mapping of in-between points in the image space into points in the latent space, as visualized by the red arrows representing . Additionally, linearly interpolated points in the latent space (blue dashed line) have no constraint that maps them back into the image manifold thus producing authentic looking images (blue arrows in the plot representing ). Incorporating adversarial loss on the reconstruction of interpolated latent vectors will improve the ability of the decoder to map latent vectors back into the image manifold (blue arrows). However, as visualized in the middle plot in Figure 5, the encoder (red arrows) might map in-between images to latent vectors that are distant from the linear line in the latent space. Finally, adding the cycle consistency loss (right plot) forces the encoder-decoder architecture to map interpolated latent vectors to authentic-looking images while those reconstructions themselves are mapped back to the original points in the latent space. These two loss functions together promote bijective mapping (one-to-one and onto) while providing realistic reconstruction of interpolated latent vectors.

Figure 6: Our proposed architecture. Dotted lines represent loss functions. is a non-learned layer that performs linear interpolation in latent space. The weights of the encoder and the decoder are shared.

The proposed architecture is visualized in Figure 6. At each iteration, we sample two images from our dataset. The two images are encoded by the shared-weight encoder into , respectively. We sample uniformly and pass to , a non-learned layer, that calculates the linear interpolation in the latent space, namely . We then decode and calculate the reconstruction loss . We then decode and alternately provide the discriminator with a samples either from or from . Finally, we pass through the encoder to obtain for the cycle consistency loss and add the loss .

3.3 Architecture

The chosen encoder architecture was VGG-inspired (Simonyan and Zisserman, 2014)

. We extract the features using convolutional blocks that start from 16 feature maps and gradually increase the number of feature maps to 128 at the last convolutional block. We then flatten the extracted features and pass them through fully connected layers until we reach our desired latent dimensionality. The decoder architecture was symmetrical to that of the encoder. We use max-pooling after each convolutional block and batch normalization with ReLU activations after each learned layer. For the COIL100 training set, we use a random 80-20 training-testing split for each experiment. During hyper-parameter optimization, we found that

produces the best results. All experiments were performed using a single NVIDIA V100 GPU.

4 Results

Evaluating the faithfulness of interpolation is often illusive. In the supervised case, where the exact parameterization and labels are known, for each interpolated image, , we can retrieve the corresponding ground truth image, , and use it for evaluation. However, defining a path between two images in the parameter space depends on the parameterization of the underlying factors governing the data which is unknown (parameters in our toy example are ). Since such a parameterization is not unique, there are multiple viable paths in the latent space that correspond to faithful interpolation between the datapoints created by . In the case where the latent manifold exhibits a certain parameterization in the latent space , it can be adopted in the parameter space to evaluate the fidelity of interpolated images.

4.1 Dataset

We used our synthetic illuminated pole dataset in order to be able to visualize the resulting manifold in low dimensions. The images were rendered using the Unity game engine where all images are taken from a fixed viewing position (from above). A single illumination source rotates at intervals of 5 degrees along the azimuth at different altitudes, ranging from 45 to 80 degrees with respect to the plane in 5 degrees intervals. This dataset contains a total of 576 images. In order to test our methods against real images with complex geometric and photometric parameterization, we used the COIL-100 dataset (Nene et al., 1996) that contains color images of 100 objects. The objects were placed on a motorized turntable against a black background. The images were taken at pose intervals of 5 degrees resulting in a total of 72 images for each class.

4.1.1 Supervised Interpolation Evaluation

For the supervised case, we tested our approach against the synthetic dataset. We train an autoencoder using the latent convexity loss where the dimensionality of the latent space was 2D or 3D. Figure 2 (lower right image) presents the resulting manifold in 3D. It can be seen that the 2D manifold is flat and linear, satisfying the convexity constraints.

Given an interval in the parameter space defined by , for some fixed , we iterate over all such intervals and compare the reconstruction of the latent space interpolation starting from to , where are the images with the illumination parameters , respectively. In our evaluation, the interpolation consists out of 37 points taken along the vector where and the reconstruction error was obtained for each such point. The parameterization was linear in the parameter space.

We compare the following network architectures: regular autoencoder, variational autoencoder and our proposed methods using supervised manifold interpolation via latent convexity loss as described in Section 3.1. We also test the resulting images while applying the image convexity loss (Equation 3). Figure 7 shows the MSE of the interpolated images with respect to the ground-truth. It is demonstrated that while convexity loss in the image space does not yield a significant improvement in terms of interpolation faithfulness, latent space convexity loss considerably lowers the reconstruction error. The latent space convexity loss provides the lowest MSE in comparison to all other methods, and it does not harm the reconstruction of the boundary images . It is also demonstrated that data interpolation in 2D latent space outperforms interpolations in 3D latent space. This outcome stems from the real intrinsic dimensionality of the manifold. This number, however, is not necessarily known in real world scenarios.

Figure 7: Mean squared error as a function of the interpolation variable on the synthetic dataset. Reconstruction performance on end-point images are similar for all methods. Latent space convexity loss outperforms all other methods.

Figure 8 demonstrates qualitative evaluation for our proposed architecture on our synthetic dataset. Each of the blocks represent interpolation between two fixed points . The first and second rows of each block correspond to a naive autoencoder and a variational autoencoder, respectively. Note that the leftmost and rightmost images, that correspond to the test set images and , exhibit realistic reconstruction while in-between reconstructions are unfaithful and contain unrealistic artifacts. The third and fourth rows corresponds to the supervised and unsupervised methods. It is shown that image reconstructions appear realistic and exhibit no artifacts. Note that the parameterization in both methods is different however it does not affect the faithfulness of interpolation.

Figure 8: In each block, we perform linear interpolation between the left-most and the right-most images. The first row in each block corresponds to a naive autoencoder (AE), The second row to a variational autoencoder (VAE) and our proposed supervised and unsupervised methods are used in the third and fourth rows (P_S and P_U, respectively).

Figure 9: Linear latent space interpolation on COIL-100 dataset. At each row, the images at the rightmost and leftmost sides are , respectively. The interpolations between and are presented with a constant latent vector interval. For every object, we present the resulting linear interpolations of the naive autoencoder (AE), variational autoencoder (VAE) and our proposed unsupervised method (P_U).

4.1.2 Unsupervised Interpolation Evaluation

For the unsupervised case, we analyze our results using qualitative and quantitative measures. The bottom row of each block in Figure 8 shows the interpolation results on the synthetic dataset using our proposed method for the unsupervised case. It can be seen that we obtain realistic-looking reconstructions, comparable to the results we obtained for the supervised case. Note, however, that it is impossible to define a ”correct” geodesic path between two points since the parameterization of the dataset is redundant. Thus, interpolation results of our supervised and unsupervised methods are realistic and without artifacts yet produce different parameterization in latent space. In both cases, the autoencoders were trained using a three-dimensional latent space. Figure 10 visualizes the resulting latent manifold for the unsupervised case. The learned manifold is mostly flat however not as smooth as in the supervised case. Note that the different parameterization obtained in the unsupervised case does not allow us to accurately quantify the reconstruction error.

Figure 10: The unsupervised latent space manifold of our architecture in 3D. While not as symmetric and smooth as the supervised manifold, we obtained a close to 2D manifold embedded in 3D space that demonstrates faithful interpolation.

We tested the unsupervised interpolation method on real images from the COIL-100 dataset. Since those images exhibit complex geometric and reflectance variations, for the sake of faithful reconstruction, dimensionality of the latent space had to be increased to 256. Figure 9 compares the resulting images of latent interpolation on COIL-100 dataset using three methods: naive autoencoder, a variational autoencoder and our proposed unsupervised method. It is demonstrated that when the interpolated images approach the middle range between , the reconstruction capability of the standard autoencoder and the VAE gets significantly worse and unrealistic artifacts are produced. Our method considerably reduce those artifacts. Figure 11 demonstrates interpolation results using all methods on a COIL-100 object.

Figure 11: Linear latent space interpolation on COIL-100 dataset. At each row, the images at the rightmost and leftmost sides are , respectively. We present the resulting interpolations of the naive autoencoder (AE), variational autoencoder (VAE) and our proposed supervised (P_S) and unsupervised (P_U) methods.

For a quantitative comparison, we fix an interval length, which is a multiplicative of 5 degrees, and calculate the reconstruction error against the available ground-truth images. In this experiment, we use an interval length of 80 degrees and interpolate 14 in-between images. The reconstruction error of interpolated images is presented in Figure 13

. It is demonstrated that our method reduces both the mean squared error and the standard deviation of the MSE for different alpha values. We then inspect the average reconstruction error on a range of intervals from 25 degrees to 70 degrees as presented at the bottom part of Figure 

13. Note that our proposed method is able to consistently reduce the reconstruction error of interpolated images even when the interval length increases.

Figure 12: Examples of a non-uniform parameterization in latent space. The interpolation changes abruptly without smoothly transitioning from one mode to another. Note that our unsupervised technique interpolates without artifacts on the top row while the VAE and AE exhibit unrealistic artifacts during the transition.

Figure 13: We use the parameterization of the dataset to evaluate the reconstruction accuracy for the regular autoencoder, the VAE, and our proposed method. Upper graphs: MSE vs. values for the three methods. Note our method was able to reduce both the reconstruction error and the standard deviation of the reconstruction error. Bottom graph: averaged MSE of the interpolated images vs. the interval length. It is shown that the accuracy reduces as the interval length increases. The proposed approach demonstrates a reduced effect on large intervals.

4.2 Conclusion & Discussion

The problem of realistic and faithful interpolation in the latent spaces of generative models have achieved tremendous success in the last few years. We argue that generative approaches that deal with manifold data are not as common as multi-modal data, and this misinterpretation of manifold data harms the competence of generative models to deal with them successfully. In this work, we argue that the manifold structures of data generated from continuous factors should be taken into account. Our main contribution is generalizing our supervised approach for the unsupervised case by applying convexity regularization using adversarial and cycle consistency losses. Using this technique, we manage to drastically improve the fidelity of interpolated images using a small dataset of images taken from various viewing directions. In future work, we intend to further investigate latent spaces that exhibit non-uniform parameterization as visualized in Figure 12 by implementing a directional derivative loss along the interpolation vector. In addition, we intend to develop a well-defined metric for performance evaluation of generative models capable of faithful interpolation.


  • Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
  • D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow (2018) Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543. Cited by: §1, §2.
  • D. Bouchacourt, R. Tomioka, and S. Nowozin (2017) Multi-level variational autoencoder: learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841. Cited by: §2.
  • A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei (2018) Avoiding latent variable collapse with generative skip models.. CoRR abs/1807.04863. Cited by: §2.
  • C. Doersch (2016) Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §1, §2, §2.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • S. Gurumurthy, R. K. Sarvadevabhatla, and R. V. Babu (2017) DeLiGAN: generative adversarial networks for diverse and limited data.. In CVPR, pp. 4941–4949. External Links: ISBN 978-1-5386-0457-1 Cited by: §2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2, §2.
  • A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §2.
  • A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1558–1566. External Links: Link Cited by: §2.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: External Links: Link Cited by: §2.
  • A. Makhzani and B. J. Frey (2017) PixelGAN autoencoders. In Advances in Neural Information Processing Systems, pp. 1975–1985. Cited by: §2.
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §2.
  • S. A. Nene, S. K. Nayar, and H. Murase (1996) Object image library (coil-100. Technical report Cited by: §4.1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §2.
  • Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos (2018) Deforming autoencoders: unsupervised disentangling of shape and appearance. In The European Conference on Computer Vision (ECCV), Cited by: §1.
  • H. Shum, S. Chan, and S. B. Kang (2008) Image-based rendering. Springer Science & Business Media. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link Cited by: §3.3.
  • A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton (2017) VEEGAN: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3308–3318. Cited by: §2.
  • V. Verma, A. Lamb, C. Beckham, A. Courville, I. Mitliagkis, and Y. Bengio (2018) Manifold mixup: encouraging meaningful on-manifold interpolation as a regularizer. arXiv preprint arXiv:1806.05236. Cited by: §1.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §2.
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §2.
  • R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala (2016) Semantic facial expression editing using autoencoded flow. arXiv preprint arXiv:1611.09961. Cited by: §2.