## 1 Introduction

Generative models are a popular approach to unsupervised machine learning. Generative neural network models are trained to produce data samples that resemble the training set. Because the number of model parameters is significantly smaller than the training data, the models are forced to discover efficient data representations. These models are sampled from a set of latent variables in a high dimensional space, here called a latent space. Latent space can be sampled to generate observable data values. Learned latent representations often also allow semantic operations with vector space arithmetic (Figure 1).

Generative models are often applied to datasets of images. Two popular generative models for image data are the Variational Autoencoder (VAE, Kingma & Welling, 2014) and the Generative Adversarial Network (GAN, Goodfellow et al., 2014). VAEs use the framework of probabilistic graphical models with an objective of maximizing a lower bound on the likelihood of the data. GANs instead formalize the training process as a competition between a generative network and a separate discriminative network. Though these two frameworks are very different, both construct high dimensional latent spaces that can be sampled to generate images resembling training set data. Moreover, these latent spaces are generally highly structured and can enable complex operations on the generated images by simple vector space arithmetic in the latent space (Larsen et al., 2016).

Generative models are beginning to find their way out of academia and into creative applications. In this paper we present techniques for improving the visual quality of generative models that are generally independent of the model itself. These include spherical linear interpolation, visualizing analogies with J-diagrams, and generating local manifolds with MINE grids. These techniques can be combined generate low dimensional embeddings of images close to the trained manifold. These can be used for visualization and creating realistic interpolations across latent space. Also by standardizing these operations independent of model type, the latent space of different generative models are more directly comparable with each other, exposing the strengths and weaknesses of various approaches.

Additionally, two new techniques for building latent space attribute vectors are introduced. On labeled datasets with correlated labels, data replication can be used to create bias-corrected vectors. Synthetic attributes vectors also can be derived via data augmentation on unlabeled data. Quantitative analysis of attribute vectors can be performed by using them as the basis for attribute binary classifiers.

## 2 Sampling Techniques

Generative models are often evaluated by examining samples from the latent space. Techniques frequently used are random sampling and linear interpolation. But often these can result in sampling the latent space from locations very far outside the manifold of probable locations.

Our work has followed two useful principles when sampling the latent space of a generative model. The first is to avoid sampling from locations that are highly unlikely given the prior of the model. This technique is very well established - including being used in the original VAE paper which adjusted sampling through the inverse CDF of the Gaussian to accommodate the Gaussian prior (Kingma & Welling, 2014). The second principle is to recognize that the dimensionality of the latent space is often artificially high and may contains dead zones that are not on the manifold learned during training. This has been demonstrated for VAE models (Makhzani et al., 2016) and implies that simply matching the model’s prior will not always be sufficient to yield samples that appear to have been drawn from the training set.

### 2.1 Interpolation

Interpolation is used to traverse between two known locations in latent space. Research on generative models often uses interpolation as a way of demonstrating that a generative model has not simply memorized the training examples (eg: Radford et al., 2015, §6.1). In creative applications interpolations can be used to provide smooth transitions between two decoded images.

Frequently linear interpolation is used, which is easily understood and implemented. But this is often inappropriate as the latent spaces of most generative models are high dimensional (> 50 dimensions) with a Gaussian or uniform prior. In such a space, linear interpolation traverses locations that are extremely unlikely given the prior. As a concrete example, consider a 100 dimensional space with the Gaussian prior =0,

=1. Here all random vectors will generally a length very close to 10 (standard deviation < 1). However, linearly interpolating between any two will usually result in a "tent-pole" effect as the magnitude of the vector decreases from roughly 10 to 7 at the midpoint, which is over 4 standard deviations away from the expected length.

Our proposed solution is to use spherical linear interpolation slerp instead of linear interpolation. We use a formula introduced by (Shoemake 85) in the context of great arc inbetweening for rotation animations:

This treats the interpolation as a great circle path on an n-dimensional hypersphere (with elevation changes). This technique has shown promising results on both VAE and GAN generative models and with both uniform and Gaussian priors (Figure 2).

### 2.2 Analogy

Analogy has been shown to capture regularities in continuous space models. In the latent space of some linguistic models “King – Man + Woman” results in a vector very close to “Queen” (Mikolov et al., 2013). This technique has also been used in the context of deep generative models to solve visual analogies (Reed et al., 2015).

Analogies are usually written in the form:

Such a formation answers the question “What is the result of applying the transformation A:B to C?” In a vector space the solution generally proposed is to solve the analogy using vector math:

Note that an interesting property of this solution is an implied symmetry:

Because the same terms can be rearranged:

Generative models that include an encoder for computing latent vectors given new samples allow for visual analogies. We have devised a visual representation for depicting analogies of visual generative networks called a “J-Diagram”. The J- Diagram uses interpolation across two dimensions two expose the manifold of the analogy. It also makes the symmetric nature of the analogy clear (Figure 3).

The J-Diagram also serves as a reference visualization across different model settings because it is deterministically generated from images which can be held constant. This makes it a useful tool for comparing results across epochs during training, after adjusting hyperparameters, or even across completely different model types (Figure 4).

### 2.3 Manifold Traversal

Generative models can produce a latent space that is not tightly packed, and the dimensionality of the latent space is often set artificially high. As a result, the manifold of trained examples is can be a subset of the latent space after training, resulting in dead zones in the expected prior.

If the model includes an encoder, one simple way to stay on the manifold is by only using out of sample encodings (ie: any data not used in training) in the latent space. This is a useful diagnostic, but is overly restrictive in a creative application context since it prevents the model from suggesting new and novel samples from the model. However, we can recover this ability by also including the results of operations on these encoded samples that stay close to the manifold, such as interpolation, extrapolation, and analogy generation.

Ideally, there would be a mechanism to discover this manifold within the latent space. In generative models with an encoder and ample out-of-sample data, we can instead precompute locations on the manifold with sufficient density, and later query for nearby points in the latent space from this known set. This offers a navigation mechanism based on hopping to nearest neighbors across a large database of encoded samples. When combined with interpolation, we call this visualization a Manifold Interpolated Neighbor Embedding (MINE). A MINE grid is useful to visualize local patches of the latent space (Figure 5).

## 3 Attribute Vectors

Many generative models result in a latent space that is highly structured, even on purely unsupervised datasets (Radford et al., 2015). When combined with labeled data, attribute vectors can be computed using simple arithmetic. For example, a vector can be computed which represents the smile attribute, which by shorthand we call a smile vector. Following (Larsen et al., 2016), the smile vector can be computed by simply subtracting the mean vector for images without the smile attribute from the mean vector for images with the smile attribute. This smile vector can then be applied to in a positive or negative direction to manipulate this visual attribute on samples taken from latent space (Figure 6).

### 3.1 Correlated Labels

The approach of building attribute vectors from means of labeled data has been noted to suffer from correlated labels (Larsen et al., 2016). While many correlations would be expected from ground truths (eg: heavy makeup and wearing lipstick) we discovered others that appear to be from sampling bias. For example, male and smiling attributes have unexpected negative correlations because women in the CelebA dataset are much more likely to be smiling than men. (Table 1).

Male | Not Male | Total | |
---|---|---|---|

Smiling | 17% | 31% | 48% |

Not Smiling | 25% | 27% | 52% |

Total | 42% | 58% |

In an online service we setup to automatically add and remove smiles from images^{1}^{1}1https://twitter.com/smilevector, we discovered this gender bias was visually evident in the results. Our solution was to use replication on the training data such that the dataset was balanced across attributes. This was effective because ultimately the vectors are simply summed together when computing the attribute vector (Figure 7).

This balancing technique can also be applied to attributes correlated due to ground truths. Decoupling attributes allows individual effects to be applied separately. As an example, the two attributes smiling and mouth open are highly correlated in the CelebA training set (Table 2). This is not surprising, as physically most people photographed smiling would also have their mouth open. However by forcing these attributes to be balanced, we can construct two decoupled attribute vectors. This allows for more flexibility in applying each attribute separately to varying degrees (Figure 8).

Open Mouth | Not Open Mouth | Total | |
---|---|---|---|

Smiling | 36% | 12% | 48% |

Not Smiling | 12% | 40% | 52% |

Total | 48% | 52% |

### 3.2 Synthetic Attributes

It has been noted that samples drawn from VAE based models is that they tend to be blurry (Goodfellow et al., 2014; Larsen et al., 2015). A possible solution to this would be to discover an attribute vector for “unblur”, and then apply this as a constant offset to latent vectors before decoding. CelebA includes a blur label for each image, and so a blur attribute vector was computed and then extrapolated in the negative direction. This was found to noticeably reduce blur, but also resulted in a number of unwanted artifacts such as increased image brightness.

We concluded this to be the result of human bias in labeling. Labelers appear more likely to label darker images as blurry, so this unblur vector was found to suffer from attribute correlation that also “lightened” the reconstruction. This bias could not be easily corrected because CelebA does not include a brightness label for rebalancing the data.

For the blurring attribute, an algorithmic solution is available. We take a large set of images from the training set and process them through a Gaussian blur filter (figure 9). Then we run both the original image set and the blurred image set through the encoder and subtract the means of each set to compute a new attribute vector for blur. We call this a synthetic attribute vector because this label is derived from algorithmic data augmentation of the training set. This technique removes the labeler bias, is straightforward to implement, and resulted in samples closely resembling the reconstructions with less noticeable blur (figure 10).

### 3.3 Quantitative Analysis with Classicication

The results of applying attribute vectors visually to images are often quite striking. But it would be useful to be able to evaluate the relative efficacy of various attribute vectors to each other or across different models and hyperparamaters. Attribute vectors have potential to be widely applicable outside of images in domain specific latent spaces, and this provides additional challenges for quantifying their performance. For example, recent work has shown the ability to use latent space model as a continuous representation of molecules (Gómez 16); though it is straightforward to generate vectors for attributes such as solubility in these spaces, the evaluation of these operations to existing molecules would appear to require specific domain knowledge.

Approach / Attribute | [1] | [2] W | [2] L | [3] ANet | [4] LNet | [4] Full | VAE AtDot | GAN AtDot |

5 O.C. Shadow | 85 | 82 | 88 | 86 | 88 | 91 | 88 | 89 |

Arched Eyebrow | 76 | 73 | 78 | 75 | 74 | 79 | 75 | 75 |

Attractive | 78 | 77 | 81 | 79 | 77 | 81 | 71 | 69 |

Bags Under Eye | 76 | 71 | 79 | 77 | 73 | 79 | 80 | 79 |

Bald | 89 | 92 | 96 | 92 | 95 | 98 | 98 | 98 |

Bangs | 88 | 89 | 92 | 94 | 92 | 95 | 90 | 90 |

Big Lips | 64 | 61 | 67 | 63 | 66 | 68 | 85 | 76 |

Big Nose | 74 | 70 | 75 | 74 | 75 | 78 | 77 | 77 |

Black Hair | 70 | 74 | 85 | 77 | 84 | 88 | 83 | 81 |

Blond Hair | 80 | 81 | 93 | 86 | 91 | 95 | 87 | 90 |

Blurry | 81 | 77 | 86 | 83 | 80 | 84 | 95 | 95 |

Brown Hair | 60 | 69 | 77 | 74 | 78 | 80 | 76 | 81 |

Bushy Eyebrow | 80 | 76 | 86 | 80 | 85 | 90 | 87 | 86 |

Chubby | 86 | 82 | 86 | 86 | 86 | 91 | 94 | 94 |

Double Chin | 88 | 85 | 88 | 90 | 88 | 92 | 95 | 95 |

Eyeglasses | 98 | 94 | 98 | 96 | 96 | 99 | 95 | 94 |

Goatee | 93 | 86 | 93 | 92 | 92 | 95 | 93 | 94 |

Gray Hair | 90 | 88 | 94 | 93 | 93 | 97 | 95 | 96 |

Heavy Makeup | 85 | 84 | 90 | 87 | 85 | 90 | 76 | 75 |

High Cheekbone | 84 | 80 | 86 | 85 | 84 | 87 | 81 | 68 |

Male | 91 | 93 | 97 | 95 | 94 | 98 | 81 | 80 |

Mouth Open | 87 | 82 | 93 | 85 | 86 | 92 | 78 | 67 |

Mustache | 91 | 83 | 93 | 87 | 91 | 95 | 95 | 96 |

Narrow Eyes | 82 | 79 | 84 | 83 | 77 | 81 | 93 | 89 |

No Beard | 90 | 87 | 93 | 91 | 92 | 95 | 85 | 84 |

Oval Face | 64 | 62 | 65 | 65 | 63 | 66 | 72 | 71 |

Pale Skin | 83 | 84 | 91 | 89 | 87 | 91 | 96 | 96 |

Pointy Nose | 68 | 65 | 71 | 67 | 70 | 72 | 72 | 72 |

Recede Hair | 76 | 82 | 85 | 84 | 85 | 89 | 93 | 92 |

Rosy Cheeks | 84 | 81 | 87 | 85 | 87 | 90 | 93 | 93 |

Sideburns | 94 | 90 | 93 | 94 | 91 | 96 | 94 | 94 |

Smiling | 89 | 89 | 92 | 92 | 88 | 92 | 87 | 68 |

Straight Hair | 63 | 67 | 69 | 70 | 69 | 73 | 80 | 79 |

Wavy Hair | 73 | 76 | 77 | 79 | 75 | 80 | 75 | 75 |

Earring | 73 | 72 | 78 | 77 | 78 | 82 | 81 | 81 |

Hat | 89 | 91 | 96 | 93 | 96 | 99 | 96 | 97 |

Lipstick | 89 | 88 | 93 | 91 | 90 | 93 | 79 | 77 |

Necklace | 68 | 67 | 67 | 70 | 68 | 71 | 88 | 88 |

Necktie | 86 | 88 | 91 | 90 | 86 | 93 | 93 | 92 |

Young | 80 | 77 | 84 | 81 | 83 | 87 | 78 | 81 |

Average | 81 | 79 | 85 | 83 | 83 | 87 | 86 | 84 |

We have found that very effective classifiers can be built in latent space using the dot product of an encoded vector with a computed attribute vector (figure 11A, 11B). We have named this technique for building classifiers on latent spaces AtDot, and models trained on unsupervised data have been shown to produce strong results on CelebA across most attributes (table 3). Importantly, these binary classifiers provide a quantitative basis to evaluate attribute vectors on a related surrogate task - their ability to be the basis for a simple linear classifier. This task is one of the most established paradigms in machine learning and provides a set of established tools such as ROC curves for offering new insights into the behavior of the attribute vectors across different hyperparameters or different models (figure 11C).

## 4 Future Work

Software to support most techniques presented in this paper is included in a python software library that can be used with various generative models^{2}^{2}2https://github.com/dribnet/plat. We hope to continue to improve the library so that the techniques are applicable across a broad range of generative models. Attribute vector based classifiers also offer a promising way to evaluate the suitability of attribute vectors in domains outside of images.

We are investigating constructing a specially constructed prior on the latent space such that interpolations could be linear. This would simplify many of the latent space operations and might enable new types of operations.

Given sufficient test data, the extent to which an encoded dataset deviates from the expected prior should be quantifiable. Developing such a metric would be useful in understanding the structure of the different latent spaces including probability that random samples fall outside of the expected manifold of encoded data.

#### Acknowledgments

I am thankful for the constructive feedback from readers including Ehud Ben-Reuven, Zachary Lipton, and Alex Champandard. I thank Victoria University of Wellington School of Design for supporting research on Creative Intelligence. I also thank the vibrant machine learning and creative coding communities on twitter for their support and encouragement.

#### References

Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb, Alex, Arjovsky, Martin, Mastropietro, Olivier Courville, Aaron. Adversarially Learned Inference. https://arxiv.org/abs/1606.00704 2016

Ehrlich, Max, Shields, Timothy J., Almaev, Timur, Amer, Mohamed R. Facial Attributes Classification Using Multi-Task Representation Learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 2016.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde- Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014.

Gómez-Bombarelli, Rafael, Duvenaud, David, Miguel, Hernández-Lobato, José, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. https://arxiv.org/abs/1610.02415 2016

Kingma, Diederik P. and Welling, Max. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, 2014.

Kumar, N., Belhumeur, P., and Nayar, S. Facetracer: A search engine for large collections of images with faces. In ECCV, 2008.

Lamb, Alex, Dumoulin, Vincent, Courville, Aaron. Discriminative Regularization for Generative Models. http://arxiv.org/abs/1602.03220 2016

Larsen, Anders Boesen Lindbo, Sønderby, Søren Kaae, Larochelle, Hugo, Winther, Ole. Autoencoding beyond pixels using a learned similarity metric. https://arxiv.org/abs/1512.09300 2016

LeCun, Y. and Bengio., Y. Word-level training of a handwritten word recognizer based on convolutional neural networks. In ICPR, 1994.

Li, J., and Zhang, Y. Learning surf cascade for fast and accurate object detection. In CVPR, 2013.

Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, Goodfellow, Ian, Brendan, Frey. Adversarial Autoencoders. http://arxiv.org/abs/1511.05644. 2016.

Mikolov, Tomas, Yih, Scott Wen-tau, Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. NAACL-HLT, 2013.

Radford , Alec, Metz , Luke, Chintala, Soumith. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. Advances in Neural Information Processing Systems, 2015.

Reed, Scott E, Zhang, Yi, Zhang, Yuting, and Lee, Honglak. Deep visual analogy-making. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 1252–1260. Curran Associates, Inc., 2015

Shoemake, Ken. Animating rotation with quaternion curves. In ACM Siggraph, 19(3):245–254, 1985.

Zhang, N., Paluri, M., Ranzato, M., Darrell, T., and Bourdev, L. Panda: Pose aligned networks for deep attribute modeling. In CVPR, 2014.

Comments

There are no comments yet.