1 Introduction
Generative modeling has the potential to become an important tool for exploring the parallels between perceptual, physical, and physiological representations in fields such as psychology, linguistics, and neuroscience (e.g. Sainburg et al. 2018; Thielk et al. 2018; Zuidema et al. 2018). The ability to infer abstract and lowdimensional representations of data and to sample from these distributions allows one to quantitatively explore and vary complex stimuli in ways which typically require handdesigned feature tuning, for example varying formant frequencies of vowel phonemes, or the fundamental frequency of syllables of birdsong.
Several classes of unsupervised neural networks such as the Autoencoder (AE; Hinton & Salakhutdinov 2006; Kingma & Welling 2013), Generative Adversarial Network (GAN; Goodfellow et al. 2014
(Hochreiter & Schmidhuber, 1997; Graves, 2013; Oord et al., 2016; Van Den Oord et al., 2016), and flowbased generative models (Kingma & Dhariwal, 2018; Dinh et al., 2014; Kingma et al., 2016; Dinh et al., 2016) are at present popularly used for learning latent representations that can be used to generate novel data samples. Unsupervised neural network approaches are ideal for data generation and exploration because they do not rely on handengineered features and thus can be applied to many different types of data. However, unsupervised neural networks often lack constraints that can be useful or important for psychophysical experimentation, such as pairwise relationships between data in neural network projections, or how well morphs between stimuli fit into the true data distribution.We propose a novel AE that hybridizes features of an AE and a GAN. Our network is trained explicitly to control for the structure of latent representations and promotes convexity in latent space by adversarially constraining interpolations between data samples in latent space to produce realistic samples^{1}^{1}1Pairwise interpolations in between latent samples may only cover a subset of the convex hull of the latent distribution, as described in Figure 1..
1.1 Background on Generative Adversarial Networks (GANs) and Autoencoders (AEs)
An AE is a form of neural network which takes as input (e.g. an image), and is trained to generate a reproduction of the input^{2}^{2}2We denote as being equivalent to , or being passed through the encoder and decoder of the generator, , by minimizing some error function between the input and output (e.g. pixelwise error). This translation is usually performed after compressing the representation into a lowdimensional representation . This lowdimensional representation is called a latent representation, and the layer corresponding to in the neural network is often called the latent layer. The first half of the network, which translates from to , is called the encoder; the second half of the network, which translates from to , is called the decoder. The combination of these two networks make the AE capable of both dimensionality reduction (); Hinton & Salakhutdinov 2006), and generativity (). Importantly, AE architectures are generative^{3}^{3}3having the power or function of generating, originating, producing, or reproducing (Webster, 2018), however they are not generative models (Bishop, 2006)
because they do not model the joint probability of the observable and target variables
. Variants such as the Variational Autoencoder (VAEs; Kingma & Welling 2013), which model are generative models. AE latent spaces, therefore, cannot be sampled probabilistically, without modeling the joint probability as in VAEs. The AE architecture that we propose here does not model the joint probability of and and thus is not a generative model, although the latent space of our network could be modeled probabilistically (e.g. with a VAE).GAN architectures are comprised of two networks, a generator, and a discriminator. The generator takes as input a latent sample, , drawn randomly from a distribution (e.g. uniform or normal), and is trained to produce a sample in the data domain . The discriminator takes as input both and , and is trained to differentiate between real , and generated
samples, typically by outputting either a 0 or 1 in a singleneuron output layer. The generator is trained to oppose the discriminator by ’tricking’ it into categorizing
samples as samples. Intuitively, this results in the generator producing samples indistinguishable (at least to the discriminator) from those drawn from the distribution . Thus the discriminator acts as a ’critic’ of the samples produced by a generator that is attempting to reproduce the distribution . Because GANs sample directly from a predefined latent distribution, GANs are generative models, explicitly representing the joint probability, .One common use for both GANs and AEs has been exploiting the semantically rich lowdimensional manifold, , on which data are either projected onto or sampled from (White, 2016; Hinton & Salakhutdinov, 2006). Operations performed in carry rich semantic features of data, and interpolations between points in produce semantically smooth interpolations in the original data space (e.g. Radford et al. 2015; White 2016). However, samples generated by latent representations of both AEs and GANs are limited by the constraints provided by the algorithm. A significant amount of work has been done over the past several years in developing variants of AEs and GANs which add additional constraints and functionality to GAN and AE architectures, for example improving stability of GANs (e.g. Berthelot et al. 2017; Radford et al. 2015; Salimans et al. 2016), disentangling latent representations (e.g. Higgins et al. 2017; Chen et al. 2016; Bouchacourt et al. 2017), adding generative capacity to AEs (e.g. Kingma & Welling 2013; Kingma et al. 2016; Makhzani et al. 2015), and adding bidirectional inference to GANs (e.g. Larsen et al. 2015; Mescheder et al. 2017; Berthelot et al. 2017; Dumoulin et al. 2016; Ulyanov et al. 2017; Makhzani 2018).
In this work, we describe several limitations of GANs and Autoencoders, specifically as they relate to stimuli generation for psychophysical research, and propose a novel architecture, GAIA, that utilizes aspects of both the AE and GAN to negate shortcomings of each architecture independently. Our method provides a novel approach to increasing the stability of network training, increasing the convexity of latent space representations, preserving of highdimensional structure in latent space representations, and bidirectionality from and .
1.2 Convexity of latent space
Generative latentspaces enable the powerful capacity for smooth interpolations between realworld signals in a highdimensional space. Linear interpolations in a lowdimensional latent space often produce comprehensible representations when projected back into highdimensional space (e.g. Engel et al. 2017; Dosovitskiy et al. 2015). In the latent spaces of many network architectures such as AEs, however, linear interpolations are not necessarily justified, because the space between endpoints on an interpolation in is not explicitly trained to fall within the data distribution when translated back into .
A convex set of points is defined as a set in which the linesegment connecting any pair of points will fall within the rest of the set (Klee, 1971). For example, in Figure 1A, the purple distribution represents data projected into a twodimensional latent space, and the surrounding whitespace represents regions of latent space that do not correspond to the data distribution. This distribution would be nonconvex because a line connecting two points in the distribution (e.g. the black points in Figure 1A) could contain points outside the data distribution (the red region). In an AE, if the red region of the interpolation were sampled, projections back into the highdimensional space may not necessarily correspond to realistic exemplars of , because that region of does not belong to the true data distribution.
One approach to overcoming nonconvexity in a latent space is to force the latent representation of the dataset into a predefined distribution (e.g. a normal distribution), as is performed by VAEs. By constraining the latent space of a VAE to fit a normal distribution, the latent space representations are encouraged to belong to a convex set. This method, however, preimposes a distribution in latent space that may be a suboptimal representation of the highdimensional dataset. Standard GAN latent distributions are sampled directly, similarly allowing arbitrary convex distributions to be explicitly chosen for latent spaces. In both cases, hardcoding the distributional structure of the latent space may not respect the highdimensional structure of the original distribution.
1.3 Pixelwise error and bidirectionality
AEs that perform dimensionality reduction (in particular VAEs) can produce blurry images due to their pixelwise loss functions
(Goodfellow et al., 2014; Larsen et al., 2015), which minimize loss by smoothing the sharp contrasts (e.g. edges) present in real data. GANs do not suffer from this blurring problem, because they are not trained to reproduce input data. Instead, GANs are trained to generate data that could plausibly belong to the true distribution . Thus, smoothing over uncertainty tends to be discouraged by the discriminator because it can use smoothed edges as a distinguishing feature between data sampled from and .Producing data that fits into the distribution of , rather than reproducing individual instances of comes at a cost, however. While AEs learn both the translation from to and to , GANs only learn the latter (). In other words, the pixelwise loss function of the AE produces smoothed data but is bidirectional, while the discriminatorbased loss function of the GAN produces sharp images and is unidirectional.
2 Generative Adversarial Interpolative Autoencoding (GAIA)
Our model, GAIA (Figure 1 left), is bidirectional but is trained on both a GAN loss function and a pixelwise loss function, where the pixelwise loss function is passed across the discriminator of the GAN to ensure that features such as blurriness are discriminated against. In full, GAIA is trained as a GAN in which both the generator and the discriminator are AEs. The discriminator is trained to minimize the pixelwise loss () between real data () and their AE reproduction in the discriminator () while maximizing the AE loss between samples generated () by the generator and their reproduction in the discriminator ():
The generator is trained on the inverse, to minimize the pixelwise loss between input () and output () such that the discriminator reproduces the generated samples to be as close to the original data as possible:
Using an AE as a generator has been previously been used in the VAEGAN (Larsen et al., 2015), and decreases blurring from the pixelwise loss in AEs at the expense of exactreproduction of data. Similarly, using an AE as a discriminator has been previously used in BEGAN (Berthelot et al., 2017), which improves stability in GANs but remains unidirectional^{4}^{4}4Although it is possible to find the regions of most closely corresponding to . In GAIA, we combine these two architectures, allowing the generator to be trained on a pixelwise loss that is passed across the discriminator, explicitly reproducing data as in an AE, while producing sharper samples characteristic of a GAN.
In addition, linear interpolations are taken between the latentspace representations:
Where interpolations are Euclidean interpolations between pairs of points in
, sampled from a 1dimensional Gaussian distribution
^{5}^{5}5, . We allow interpolations to go past the endpoints to allow for some degree of generalization beyond interpolations. We sample along the midpoint using a Gaussian rather than uniformly because we found that interpolations near to original samples required less training than interpolations to produce realistic interpolations. centered around the midpoint between and . The midpoints are then passed through the decoder of the discriminator, which are treated as generated samples by the GAN loss function:The discriminator is trained to maximize this loss, and the generator is trained to minimize this loss.
In full, the loss of the discriminator, as in BEGAN, is to minimize pixelwise loss of real data, and maximize pixelwise loss of generated data (including interpolations):
The loss of the generator is to minimize the error across the descriminator for the input data in the generator (), along with the minimizing the error of the interpolations generated by the generator ().
In summary, the generator and discriminator are both AEs. As a result, reconstructions of have the potential to resemble the input data () at a pixel level, a feature nonexistent in other GAN based inference methods (Figure 5). We also train the network on interpolations in the generator, to explicitly train the generator to produce interpolations () which deceive the discriminator and are closer to the distribution in than interpolations from an unconstrained AE.
2.1 Preservation of localstructure in highdimensional data
VAEs and GANs force the latent distribution,
, into a predefined distribution, for example, a Gaussian or uniform distribution. This approach presents a number of advantages, such as ensuring latent space convexity and thus being better able to sample from the distribution. However, these benefits are gained at the loss of respecting the structure of the distribution of the original high dimensional dataset,
. Preserving highdimensional structure in low dimensional embeddings is often the goal of dimensionality reduction, one of the functions of an autoencoder (Hinton & Salakhutdinov, 2006). To better respect the original highdimensional structure of the dataset, we impose a regularization between the latent space representations of the data () and the original high dimensional dataset (), motivated by Multidimensional Scaling (Kruskal, 1964).For each minibatch presented to the network, we compute a loss for the distance between the log of the pairwise Euclidean distances of samples in and space:
We then apply this error term to the generator to encourage the pairwise distances of the minibatch in latent space to be similar to the pairwise distances of the minibatch in the original highdimensional space. Similar methods for preserving withinbatch structure have previously been used in both AEs (Yu et al., 2013) and GANs (Benaim & Wolf, 2017).
3 Experiments
Here we apply out network architecture to two datasets: (1) five 2D distributions from Scikitlearn (Pedregosa et al., 2011) which allows us to visualize and quantify the behavior of GAIA in a lowdimensional space (Figure 2), and (2) the CELEBAHQ dataset (Liu et al., 2015; Karras et al., 2017) which allows us to test the performance of GAIA on a more complex high dimensional dataset (Figure 3).
3.1 2D datasets
We compared the performance of AE, VAE, and GAIA networks with the same architecture and training parameters on five 2D distributions from Scikit Learn (Pedregosa et al., 2011). We also compared the GAIA network with and without the distance loss term (). We computed the learned latent representations () as well as reconstructions () from of each of the networks (Figures 2, 6), and compared the these spaces on a number of metrics (Table 1).
Our most salient observation can be found in the meshgrids in Figure 2, where a clear boundary exists in the warping of high and lowprobability data in GAIA, as opposed to an autoencoder without adversarial regulation. The meshgrids () are generated by sampling uniformly in from points within the convex hull of , and projecting those points into . The grid is then plotted at the corresponding points from in . A similar warping of lowprobability data is seen in the VAE, although a smoother warping is seen at the boundaries.
We quantitatively analyzed the results of Figure 2 in Table 1. We found that interpolations in GAIA () are the most likely to fall into the distribution of (Figure 2 top; ). We also found that the distributions of both network reconstructions and interpolations in most highly match the input distribution (
) in the VAE network (measured by KullbackLeibler divergence). This is likely due to the adversarial loss in GAIA. While VAEs are trained to match the distribution of
, GAIA’s generator is trained to find regions of which are sufficiently highenough probability that the discriminator will not discriminate against it. Finally, we found that pairwise Euclidean distances in most highly resembled the original data distribution () in the GAIA network when the loss was imposed on the network. This leads us to conclude that GAIA can learn to map interpolations in latentspace onto the true data distribution in in a similar manner as a VAE, while still respecting the original structure of the data.Model  

AE  0.58  3207.39  3545.63  0.43  0.57 
VAE  0.64  3574.11  3588.50  0.05  0.64 
GAIA_{ll=0}  0.38  3567.49  3564.27  0.19  0.36 
GAIA_{ll=1}  0.91  3593.84  3563.91  0.36  0.32 
Results are intercepts from an OLS regression controlling for 2D dataset type, thus some values 
(such as KL divergence) can be negative. 
Pearson correlation Loglikelihood KullbackLeibler divergence 
3.2 CelebaHq
To observe the performance of our network on a more complex and high dimensional data, we use the CELEBAHQ image dataset of aligned celebrity faces. We find that interpolations in
produce smooth realistic morphs in (Figure 3), and that complex features can be manipulated as linear vectors in the lowdimensional latent space of the network (Figure
4).3.2.1 Feature manipulation
Feature manipulation using generative models typically fall into two domains: (1) fully unsupervised approaches, where feature vectors are extracted and applied after learning (e.g. Radford et al. 2015; Larsen et al. 2015; Kingma & Dhariwal 2018; White 2016), and (2) supervised and partially supervised approaches, where highlevel feature information is used during learning (e.g. Choi et al. 2017; Isola et al. 2017; Li et al. 2016; Perarnau et al. 2016; Zhu et al. 2017).
We find that, similar to the latter group of models, highlevel features correspond to linear vectors in GAIA’s latent spaces (Figure 4). Highlevel feature representations are typically determined using the means of representations of images containing features (e.g Radford et al. 2015; Larsen et al. 2015). The mean of the latent representations of the faces in the dataset (here CELEBAHQ) containing an attribute () and not containing that attribute () is subtracted () to acquire a highlevel feature vector. The feature vector is then added to, or subtracted from, the latent representation of individual faces (), which is passed through the decoder of the generator, producing an image containing that highlevel feature.
Similar to White (2016), we find that this approach is confounded by features being tangled together in the CELEBAHQ dataset. For example, adding a latent vector to make images look older biases the image toward male, and making the image look more young biased the image toward female. This likely happens because the ratio of young males to older males is 0.28:1, whereas the ratio of young females to older females is much greater at 8.49:1 in the CELEBAHQ dataset. As opposed to White (2016)
, who balance samples containing features in the training dataset, we use the coefficients of an ordinary leastsquares regression trained to predict
representations from feature attributes on the full dataset as feature vectors. We find that these features (Figure 7 bottom) are less intertwined than subtracting means alone (Figure 7 top).3.3 Related work
This work builds primarily upon the GAN and AE. We used the AE as a discriminator motivated by Berthelot et al. (2017), and an AE as a generator motivated by Larsen et al. (2015), which in concert act as both an autoencoder and a GAN imparting bidirectionality on a GAN and imparting an adversarial loss on the autoencoder. A number of other adversarial network architectures (e.g. Larsen et al. 2015; Mescheder et al. 2017; Berthelot et al. 2017; Dumoulin et al. 2016; Ulyanov et al. 2017; Makhzani 2018) have been designed with a similar motivation in recent years. Our approach differs from these methods in that, by using an autoencoder as the discriminator, we are able to use a reconstruction loss which is passed across the discriminator, resulting in pixelwise data reconstructions (Figure 5).
Similar motivations for better bidirectional inferencebased methods have also been explored using flowbased generative models (Kingma & Dhariwal, 2018; Dinh et al., 2014; Kingma et al., 2016; Dinh et al., 2016), which do not rely on an adversarial loss. Due to their exact latentvariable inference (Kingma & Dhariwal, 2018), these architectures may also provide a useful direction for developing generative models to explore latentspaces of data for generating datasets for psychophysical experiments.
In addition, the first revision of this work was published concurrently to ACAI network (Berthelot et al., 2018), which also uses an adversarial constraint on interpolations in the latent space of an autoencoder. Berthelot et al. find that adversarially constrained latent representations improve downstream tasks such as classification and clustering. At a high level, GAIA and ACAI networks perform the same functions, however, there are a few notable differences between the two networks. While ACAI uses an autoencoder as the discriminator of the adversarial network to improve pass the autoencoder error function across the discriminator, ACAI uses a traditional discriminator. As a result, the loss function is different between the two networks. Further comparisons are needed between the two architecture to compare network features such as training stability, reconstruction quality, latent feature representations, and downstream task performance.
4 Conclusion
We propose a novel GANAE hybrid in which both the generator and discriminator of the GAN are AEs. In this architecture, a pixelwise loss can be passed across the discriminator producing autoencoded data without smoothed reconstructions. Further, using the adversarial loss of the GAN, we train the generator’s AE explicitly on interpolations between samples projected into latent space, promoting a convex latent space representation. We find that in our 2D dataset examples, GAIA performs equivalently to a VAE in projecting interpolations in onto the true data distribution in , while respecting the original structure in . We conclude that our method more explicitly lends itself to interpolations between complex signals using a neural network latent space, while still respecting the highdimensional structure of the input data.
The proposed architecture still leaves much to be accomplished, and modifications of this architecture may prove to be more useful, for example utilizing different encoder strategies such as progressively growing layers (Karras et al., 2017), interpolating across the entire minibatch rather than twopoint interpolations, modeling the joint probability of X and Z, and exploring other methods to train more explicitly on a convex latent space. Further explorations are also needed to understand how interpolative sampling effects the structure of the latent space of GAIA in higher dimensions.
Our network architecture furthers generative modeling by providing a novel solution to maintaining pixelwise reconstruction over an adversarial architecture. Further, we take a step in the direction of convex latent space representations in a generative context. This architecture should prove useful both for current behavioral scientists interested in sampling from smooth and plausible stimuli spaces (e.g. Sainburg et al. 2018; Thielk et al. 2018; Zuidema et al. 2018), as well as providing motivation for future solutions to structured latent representations of data.
Our network was trained using Tensorflow, and our full model and code are available on
GitHub. A video of a trained network interpolating between real images is available on YouTube.References
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Benaim & Wolf (2017) Sagie Benaim and Lior Wolf. Onesided unsupervised domain mapping. In Advances in neural information processing systems, pp. 752–762, 2017.
 Berthelot et al. (2017) David Berthelot, Thomas Schumm, and Luke Metz. Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 Berthelot et al. (2018) David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543, 2018.
 Bishop (2006) Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 Bouchacourt et al. (2017) Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multilevel variational autoencoder: Learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841, 2017.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.

Choi et al. (2017)
Yunjey Choi, Minje Choi, Munyoung Kim, JungWoo Ha, Sunghun Kim, and Jaegul
Choo.
Stargan: Unified generative adversarial networks for multidomain imagetoimage translation.
arXiv preprint, 1711, 2017.  Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Dinh et al. (2016) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

Dosovitskiy et al. (2015)
Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox.
Learning to generate chairs with convolutional neural networks.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1538–1546, 2015.  Dumoulin et al. (2016) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Engel et al. (2017) Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders. arXiv preprint arXiv:1704.01279, 2017.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
 Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Huang et al. (2018) Xun Huang, MingYu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised imagetoimage translation. arXiv preprint arXiv:1804.04732, 2018.

Isola et al. (2017)
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
arXiv preprint, 2017.  Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 Kingma & Dhariwal (2018) Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Klee (1971) Victor Klee. What is a convex set? The American Mathematical Monthly, 78(6):616–631, 1971.
 Kruskal (1964) Joseph B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964.
 Larsen et al. (2015) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 Li et al. (2016) Mu Li, Wangmeng Zuo, and David Zhang. Deep identityaware transfer of facial attributes. arXiv preprint arXiv:1610.05586, 2016.
 Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 Makhzani (2018) Alireza Makhzani. Implicit autoencoders. arXiv preprint arXiv:1805.09804, 2018.
 Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.
 Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
 Perarnau et al. (2016) Guim Perarnau, Joost van de Weijer, Bogdan Raducanu, and Jose M Álvarez. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Sainburg et al. (2018) Tim Sainburg, Marvin Thielk, and Timothy Gentner. Sampling generative networks. Computational Cognitive Neuroscience, 2018.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
 Thielk et al. (2018) Marvin Thielk, Tim Sainburg, Tatyana Sharpee, and Timothy Gentner. Combining biological and artificial approaches to understand perceptual spaces for categorizing natural acoustic signals. Computational Cognitive Neuroscience, 2018.
 Ulyanov et al. (2017) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Adversarial generatorencoder networks. CoRR, abs/1704.02304, 2017.
 Van Den Oord et al. (2016) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, pp. 125, 2016.
 Webster (2018) M Webster. generative, 2018. URL https://www.merriamwebster.com/dictionary/generative.
 White (2016) Tom White. Sampling generative networks. arXiv preprint arXiv:1609.04468, 2016.
 Yu et al. (2013) Wenchao Yu, Guangxiang Zeng, Ping Luo, Fuzhen Zhuang, Qing He, and Zhongzhi Shi. Embedding with autoencoder regularization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 208–223. Springer, 2013.
 Zhu et al. (2017) JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint, 2017.
 Zuidema et al. (2018) Willem Zuidema, Robert M. French, Raquel G. Alhama, Kevin Ellis, Tim O’Donnell, Tim Sainburg, and Timothy Q. Gentner. Five ways in which computational models can help advancing artificial grammar learning research. in press, 2018.
5 Appendix
5.1 Network Architecture
In principle, any form of AE network can be used in GAIA. In the experiments shown in this paper, we used two different types of networks. For the 2D dataset examples, we use 6 fully connected layers per network with 256 units per layer, and a latent layer with two neurons. For the CELEBAHQ dataset a modified version of network architecture advocated by Huang et al. (2018), which is comprised of a style and content AE using residual convolutional layers. Each layer of the decoder uses half as many filters as the encoder, and a linear latent layer is used in the encoder network but not the decoder network. The final number of latent neurons for the style and content networks are both 512 in the pixel model shown here. The loss term for the pairwisedistance loss term is set at 2e5. A Python/Tensorflow implementation of this network architecture is linked in the Conclusions section, and more details about the network architecture used are located in Huang et al. (2018).
5.2 Instability in adversarial networks
GANs are notoriously challenging to train, and refining techniques to balance and properly train GANs has been an area of active research since the conception of the GAN architecture (e.g. Berthelot et al. 2017; Salimans et al. 2016; Mescheder et al. 2017; Arjovsky et al. 2017). In traditional GANs, a balance needs to be found between training the generator and discriminator, otherwise one network will overpower the other and the generator will not learn a representation which fits the dataset. With GAIA, additional balances are required, such as between reproducing real images vs. discriminating against generated images, or balancing the generator of the network toward emphasizing autoencoding vs. producing highquality latentspace interpolations.
We propose a novel, but simple, GAN balancing act which we find to be very effective. In our network, we balance the GAN’s loss using a sigmoid centered at zero:
In which is a hyperparameter representing the slope of the sigmoid^{6}^{6}6kept at 20 for our networks, and is the difference between the two values being balanced in the network. For example, the balance in the learning rate of the discriminator and generator is based upon the loss of the real and generated images:
The learning rate of the generator is then set as the inverse:
This allows each network to catch up to the other network when it is performing worse. The same principles are then used to balance the different losses within the generator and discriminator, which can be found in Algorithm 1. This balancing act allows the part of the network performing more poorly to be emphasized in the training regimen, resulting in more balanced and stable training.
Comments
There are no comments yet.