Periodic Spatial Generative Adversarial Networks
This paper introduces a novel approach to texture synthesis based on generative adversarial networks (GAN) (Goodfellow et al., 2014). We extend the structure of the input noise distribution by constructing tensors with different types of dimensions. We call this technique Periodic Spatial GAN (PSGAN). The PSGAN has several novel abilities which surpass the current state of the art in texture synthesis. First, we can learn multiple textures from datasets of one or more complex large images. Second, we show that the image generation with PSGANs has properties of a texture manifold: we can smoothly interpolate between samples in the structured noise space and generate novel samples, which lie perceptually between the textures of the original dataset. In addition, we can also accurately learn periodical textures. We make multiple experiments which show that PSGANs can flexibly handle diverse texture and image data sources. Our method is highly scalable and it can generate output images of arbitrary large size.READ FULL TEXT VIEW PDF
Generative adversarial networks (GANs) are a recent approach to train
This paper presents a novel framework for generating texture mosaics wit...
The real world exhibits an abundance of non-stationary textures. Example...
In the paper we construct a fully convolutional GAN model: LocoGAN, whic...
We present a novel texture synthesis framework, enabling the generation ...
In the present study, we propose to implement a new framework for estima...
Neural Cellular Automata (NCA) have shown a remarkable ability to learn ...
Periodic Spatial Generative Adversarial Networks
Textures are important perceptual elements, both in the real world and in the visual arts. Many textures have random noise characteristics, formally defined as stationary, ergodic, stochastic processes (Georgiadis et al., 2013). There are many natural image examples with such properties, e.g. rice randomly spread on the ground. However, more complex textures also exist in nature, e.g. those that exhibit periodicity like a honeycomb or fish scales.
The goal of texture synthesis is to learn from a given example image a generating process, which allows to create many images with similar properties. Classical texture synthesis methods include instance based approaches (Efros & Leung, 1999; Efros & Freeman, 2001), where pixels or patches of the source image are resampled and copied next to similar image regions, so that a seamless bigger texture image is obtained. Such methods have good visual quality and can deal with periodic images, but have a high runtime complexity when generating big images. In addition, since they do not learn an explicit model of images but just copy patches from the original pixels, they cannot be used to generate novel textures from multiple examples.
Parametric methods define an explicit model of a “good” texture by specifying some statistical properties; new texture images that are optimal w.r.t. the specified criteria are synthesized by optimization. The method of (Portilla & Simoncelli, 2000) yields good results in creating various textures, including periodic ones (the parametric statistics include phase variables of pre-specified periodicity). However, the run-time complexity is high, even for small output images. The authors also tried blending of textures, but the results were not satisfactory: patch-wise mixtures were obtained, rather than a new homogeneous texture that perceptually interpolates the originals.
More recently, deep learning methods were shown to be a powerful, fast and data-driven, parametric approach to texture synthesis. The work of(Gatys et al., 2015a)
is a milestone: they showed that filters from a discriminatively trained deep neural network can be used as effective parametric image descriptors. Texture synthesis is modeled as an optimization problem.(Gatys et al., 2015b) also showed the interesting application of painting a target content photo in the style of a given input image: “neural art style transfer”. Related works speed-up texture synthesis and style transfer by approximating the optimization process by feed-forward convolutional networks (Ulyanov et al., 2016; Johnson et al., 2016).
However, the choice of descriptor in all of these related works – the Gram matrix of learned filters – is a specific prior on the learnable textures for the method. It generalizes to many, but not all textures – e.g. periodic textures are reproduced inaccurately. Another limitation is that texture synthesis is performed from a single example image only, lacking the ability to represent and morph textures defined by several different images. In a related work, (Dumoulin et al., 2016) explored the blending of multiple styles by parametrically mixing their statistical descriptors. The results are interesting in terms of image stylization, but the synthesis of novel blended textures has not been shown.
Purely data driven generative models are an alternative deep learning approach to texture synthesis. Introduced in (Goodfellow et al., 2014), generative adversarial networks (GAN) train a model that learns a data distribution from example data, and a discriminator that attempts to distinguish generated from training data. The GAN architecture was further improved (Radford et al., 2015)
by using deep convolutional layers with (fractional) stride. GANs have successfully created “natural” images of great perceptual quality that can fool even human observers. However, pixel resolution is usually low, and the output image size is pre-specified and fixed at training time.
For the texture synthesis use case, fully convolutional layers, which can scale to any image size, are advantageous. (Li & Wand, 2016) presented an interesting architecture, that combines ideas from GANs and the pre-trained descriptor of (Gatys et al., 2015a) in order to generate small patches with the statistics of layer activations from the VGG network. This method allows fast texture synthesis and style transfer.
Spatial GAN (SGAN) (Jetchev et al., 2016) applied for the first time fully unsupervised GANs for texture synthesis. SGANs had properties like good scalability w.r.t. speed and memory, and showed excellent results on certain texture classes, surpassing the results of (Gatys et al., 2015a). However, some classes of textures cannot be handled, and no plausible texture morphing is possible.
The current contribution, PSGAN, makes a great step forward with respect to the types of images a neural texture synthesis method can create – both periodic and non-periodic images are learned in an unsupervised way from single images or large datasets of images. Afterwards, flexible sampling in the noise space allows to create novel textures of potentially infinite output size, and smoothly transition between them. Figure 1 shows a few example textures generated with a PSGAN. In the next section we describe in detail the architecture of the PSGAN, and then proceed to illustrate its abilities with a number of experiments††Our source code is available at https://github.com/zalandoresearch/psgan.
In GANs, the generative model
maps a noise vectorto the input data space. As in SGANs (Jetchev et al., 2016), we generalize the generator to map a noise tensor to an image , see Figure 2. The first two dimensions, and , are spatial dimensions, and are blown up by the generator to the respective input spatial dimensions and . The final dimension of , , is the channel dimension.
In analogy to the extension of the generator , we extend the discriminator to map from an input image to a two-dimensional field of spatial size . Each position of the resulting discriminator , responds only to a local part , which we call ’s effective receptive field. The response ofis real instead of being generated by .
As the discriminator outputs a field, we extend the standard GAN cost function to marginalize spatially:
This function is then minimized in and maximized in , . Maximizing the first line of eq. 1 in leads the discriminator to return values close to (i.e. “fake”) for generated images – and, vice versa, minimization in aims at the discriminator taking large output values close to (i.e. “real”). On the other hand, maximizing in the second line of eq. 1 anchors the discriminator on real data to return values close to . As we want the model to be able to learn from a single image, the input image data is augmented by selecting patches from the image(s) at random positions. To speed-up convergence, in particular in the beginning of the learning process, we employ the standard GAN trick and substitute with (Goodfellow et al., 2014).
We base the design of the generator network and the discriminator network on the DCGAN model (Radford et al., 2015). Empirically, choosing and to be symmetric in their architecture (i.e. depth and channel dimensions) turned out to stabilize the learning dynamics. In particular, we chose equal sizes for the image patches and the generated data
. As a deviation from this symmetry rule, we found that removing batch normalization in the discriminator yields better results, especially on training with single images.
In contrast to the DCGAN architecture, our model contains exclusively convolutional layers. Due to the convolutional weight sharing, this allows that a network trained on small image patches can be rolled out to synthesize arbitrary large output images after training. Upon successful training, the sampled images then match the local image statistics of the training data. Hence, the model implements a spatial stochastic process. Further, if components of are sampled independently, the limited receptive fields of the generator imply that the generator implements a stationary, ergodic and strongly mixing stochastic process. This means that sampling of different textures is not possible – this would require a non-ergodic process. For independent sampling, learning from a set of textures results in the generation of textures combining elements of the whole set. Another limitation of independent sampling is the impossibility to align far away regions in the generated image – alignment violates translation invariance, stationarity and mixing. However, periodic textures depend on long-range correlations.
To get rid of these limitations, we extend to be composed of three distinct parts: a local independent part , a spatially global part , and a periodic part . Each part has the same spatial dimensions , but may vary in their respective channel dimensions , , and . Let be their concatenation with total channel dimension . We proceed with a discussion on ’s three parts.
Conceptually, the simplest approach is to sample each slice of at position and , i.e
, independently from the uniform distribution, where with and . As each
affects a finite region in the image, we speak of local dimensions. Intuitively, local dimensions allow the generative process to produce spatial variance and diversity by sampling from its statistical model.
For the global dimensions, a unique vector of dimensionality is sampled from , which is then repeated along all spatial dimensions of , or , where , , and . Thus, has global impact on the whole image, and allows for the selection of the type of structure to be generated – employing global dimensions, the generative stochastic process becomes non-ergodic. Consider the task of learning from two texture images: the generator then only needs to “learn” a splitting of
in two half-spaces (e.g. by learning a hyperplane), where vectorsfrom each half-space generate samples in the style of one of the two textures.
Besides the scenario of learning from a set of texture images, combination with random patch selection from a larger image (see Section 2) is particularly interesting: here, the converged generator samples textures that are consistent with the local statistics of an image. Notably, the source image does not necessarily have to be a texture, but the method will extract a texture generating stochastic process from the image, nevertheless (see Figure 5).
After learning, each vector represents a texture from the manifold of learned textures of the PSGAN, where corresponds to a generating stochastic process of a texture, not just a static image. For the purpose of image generation, does not need to be composed of a single vector, but can be a smooth function in and . As long as neighboring vectors in don’t vary too rapidly, the statistics of is close to the statistics during training. Hence, smoothness in implies a smooth texture change in (see Figure 7).
The third part of , , contains spatial periodic functions, or plane waves in each channel :
where , , , and is a matrix which contains the wave vectors as its column vectors. These vectors parametrize the direction and the number of radians per spatial unit distance in the periodic channel . is a random phase offset uniformly sampled from , and mimics the random positional extraction of patches from the real images. Adding this periodic global tensor breaks translation invariance and stationarity of the generating process. However, it is still cyclostationary.
While wave numbers could be set to a fixed basis, we note that a specific texture has associated wave vectors, i.e. different textures will have different axes of periodicities and scales. Hence, we make dependent on the global dimensions
through a multi-layer perceptron (MLP), when more than one texture is learned. When only one texture is learned, i.e., the wave numbers are direct parameters to the system. In Figure 2, we indicate this alternative dependency on with a dotted arrow between the MLP and . All parameters of the MLP are learned end-to-end alongside the parameters of the generator and the discriminator .
We base our system on the DCGAN architecture (Radford et al., 2015) with a stride of for the generator and 2 for the discriminator. Local and global noise dimensions are sampled from a uniform distribution. As in DCGAN, filters have 64 channels at the highest spatial resolution, and are doubled after every layer, which halves the spatial resolution. E.g. the 4 layer architecture has channels between the noise input and output RGB image. Training was done with ADAM (Kingma & Ba, 2014) with the settings of (Radford et al., 2015) – learning rate , minibatch size of 25. The typical image patch size was 160x160 pixels. We usually used 5 layers in and (see Table 1
), kernels of size 5x5 with zero padding, and batch normalization. Such a generator upsamples the spatial noise by a factor ofand has a receptive field size of 125. Receptive field and image patch size can both affect learning (Jetchev et al., 2016)
. On our hardware (Theano and Nvidia Tesla K80 GPU) we measuredseconds for the generation of a 256x256 pixels image and seconds for a 2048x2048 pixels image.
The MLP for the spatially periodic dimensions has one hidden layer of dimensionality :
is the point-wise rectified-linear unit function, and we have, , and , and . We used
for the experiments. All parameters are initialized from an independent random Gaussian distribution, except and , which have a non-zero mean . The constant vector is chosen with entries spread in the interval 111Ideally, the wave numbers , with , should be within the valid interval between the negative and positive Nyquist wave numbers (here ). However, wave numbers of single sinusoids are projected back into this interval. Hence, no constraint is necessary.. For simplicity, we write , or briefly , to summarize the way the periodic dimensions arise from the global ones. Alternatively, for not being composed of a single vector , we write for simplicity and understand this as , where denotes the vector slice in along its last (i.e. ) dimension.
The following image sources were used for the experiments in this paper: the Oxford Describable Textures Dataset (DTD) (Cimpoi et al., 2014), which is composed of various categories, each containing images; the Facades dataset (Radim Tyleček, 2013), which contains 500 facades of different houses in Prague. Both datasets comprise objects of different scales and sizes. We also used satellite images of Sydney from Google Maps. The P6 and Merrigum house are from Wikimedia Commons.
What are criteria for good texture synthesis? The way humans perceive a texture is not easily quantifiable with a statistic or metric. Still, one can qualitatively assess whether a texture synthesis method captures the right properties of a source image. In order to illustrate this, we will demonstrate how we can learn complex periodic images and texture manifolds, which allow texture blending.
First, we demonstrate learning a single periodic texture image. Figure 3 illustrates the results of PSGAN compared with SGAN (Jetchev et al., 2016), and the methods of (Gatys et al., 2015a; Efros & Freeman, 2001; Portilla & Simoncelli, 2000). The text example in the top row has a periodic and stochastic dimension. The PSGAN learns this and arranges “text” in regular lines, while varying their content horizontally. The methods of (Efros & Freeman, 2001; Portilla & Simoncelli, 2000) also manage to do this. SGAN (equivalent to a PSGAN without periodic dimensions) and Gatys’ method fail to capture the periodic structure.
The second row in Figure 3 demonstrates learning a honeycomb texture – a basic hexagonal pattern – where our method captures both the underlying periodicity and the random coloring effects inside the cells. The method of (Efros & Freeman, 2001) was inaccurate for that texture – the borders between the copied patches (60x60 pixels large) were inaccurately aligned. The other 3 methods fail to produce a hexagonal structure even locally. The last row of the figure shows the autocorrelation plots of the honeycomb textures, where the periodicity reveals itself as a regular grid superimposed onto the background, a feature only PSGAN is able to reproduce.
While periodic dimensions are enough to learn the above patterns, we noticed that training convergence is faster when setting . However, for beating of sinusoids with close wave numbers can occur, which rarely happens also for due to sub-Nyquist artefacts (Amidror, 2015), i.e. when the texture periodicity is close to an integer fractional of the Nyquist wavenumber.
Figure 4 shows a larger slice of the learned periodic textures. In particular, Figure 4B shows that learning works for more complex patterns, here a pattern with a P6 wallpaper group symmetry222en.wikipedia.org/wiki/Wallpaper_group. Note that only translational symmetries are represented in PSGANs, no rotation and reflection symmetries., with non-orthogonal symmetry axes.
Next, we extract multiple textures from a single large image, or a set of images. The chosen images (e.g. landscape photography or satellite images) have a global structure, but also exhibit characteristics of many textures in a single image (e.g. various vegetation and houses). The structured PSGAN generator noise with global dimensions allows to extract textures, corresponding to different image regions.
In order to visualize the texture diversity of a model, we define a quilt array that can generate different textures from a trained PSGAN model by setting rectangular spatial regions (tiles) of size to the same vector, randomly sampled from the prior. Since the generator is a convolutional network with receptive fields over several spatial elements of , the borders between tiles look partially aligned. For example, in Figure 1 the borders of the tiles have scaly elements across them, rather than being sharply separated (as the input per construction).
Figure 5 shows results when trained on a single large image. PSGAN extracts diverse bricks, grass and leaf textures. In contrast, SGAN forces the output to be a single mixing process, rather than a multitude of different visual textures. Gatys’ method also learns a single texture-like process with statistics from the whole image. 333As a technical note, the whole image did not fit in memory for Gatys’ method, so we trained it only on a 1920x1920 clip-out.
Figure 6A shows texture learning from city satellite images, a challenging image domain due to fine details of the images. Figures 6B and C show results from training on a set of multiple texture-like images from DTD.
In order to show that textures vary smoothly, we sample 4 different values in the four corners of a target image and then interpolate bi-linearly between them to construct the tensor. Figure 7 shows that all values lying between the original 4 points generate proper textures as well. Hence, we speak of a learned texture manifold.
In this section, we explore how and – the global and periodic dimensions – influence the output generated from the noise tensor. Take a array with quilt structure. We define as an array of the same size as , where all are set to the same . We calculate two different periodic tensors, : the first tensor with wave numbers varying as a function of the different elements of the quilt, and the second tensor, , with the same wave numbers everywhere.
The PSGAN is trained with minibatches for which it holds that , but the model is flexible and produces meaningful outputs even when setting and to different values. Figure 9 shows that the global and periodic dimensions encode complementary aspects of the image generation process: texture identity and periodicity. The facades dataset has strong vertical and horizontal periodicity which is easily interpretable – the height of floors and window placement directly depends on these frequencies.
This disentangling leads to instructive visualizations. Figure 8 shows the generation from a tensor , which is constructed as a linear interpolation between two sampled at the left and right border. However, the wave numbers of the periodic dimensions are fixed, independently of the changing global dimensions. The figure clearly shows a change in visual appearance of the texture (controlled by the global dimensions), while preserving a consistent periodic structure (fixed by the constant wave numbers). This PSGAN disentangling property is reminiscent of the way (Chen et al., 2016) construct categorical and continuous noise variables, which explain factors of variation such as object identity and spatial transformation.
Texture synthesis from large unlabeled image datasets requires novel data-driven methods, going beyond older techniques that learn from single textures and rely on pre-specified statistical descriptors. Previous methods like SGAN are limited to stationary, ergodic and stochastic textures – even if trained on many images, SGAN fuses them and outputs a single mixing process for them. Our experiments suggest that Gatys’ method exhibits similar limitations. In contrast, PSGAN models non-ergodic cyclostationary processes, and can learn a whole texture manifold from sets of images, or from a single large image.
CGANs (Mirza & Osindero, 2014) use additional label information as input to the GAN generator and discriminator, which allows for class conditional generation. In comparison, the PSGAN also uses additional information in the generator input (the specifically designed periodic dimensions ), but not in the discriminator. Our method remains fully unsupervised and uses only sampled noise, unlike CGANs which require specific label information.
Concerning the model architecture, the SGAN (Jetchev et al., 2016) model is similar – it can be seen as an ablated PSGAN instance with . This architecture allows great scalability (linear memory and runtime complexity w.r.t. output image pixel size) of the PSGAN when generating outputs. High resolution images can be created by splitting parts of the arrays and rendering them sequentially, thus having a constant GPU memory footprint. Another nice property of our architecture is the ability to stitch seamlessly output texture images and get tileable textures, potentially increasing the output image size even more.
To summarize, these are the key abilities of the PSGAN:
learn textures of great variability from large images
learn periodical textures
learn whole manifolds of textures and smoothly blend between their elements, thus creating novel textures
generate images of any desired size with a fast forward pass of a convolutional neural network
linear scalability in memory and speed w.r.t. output image size.
Our method has a few limitations: convergence can be sometimes tricky, as noted for other GAN models (Radford et al., 2015)
; like GANs, the PSGAN can suffer from “mode dropping” – given a large set of textures it may learn only some of them, especially if the data varies in scale and periodicity. Finally, PSGANs can represent arbitrary probability distributions that extend in spatial scale to the largest periods in, and can generalize to periodic structures beyond that. However, images that have larger structures or more general non-periodic features are not representable: e.g. images with a global trend, or with a perspective projection, or aperiodic images, like Penrose tilings.
The PSGAN has a great potential to be adapted to further use cases. In-painting is a possible application - our method can fill random missing image regions with fitting textures. Texture style transfer – painting a target image with textures – can be done similar to the way the quilts in this paper were constructed. Further, explicit modeling with periodic dimensions in the PSGAN could be a great fit in other modalities, in particular time-series and audio data. Here, we’d expect the model to extract “sound textures”, which might be useful in synthesizing completely novel sounds by interpolating on the manifold.
On the theoretical side, to capture more symmetries of texture images, one could extend the tensor even further, by adding dimensions with reflection or rotation symmetries. In terms of model stability and convergence, we’ll investigate alternative GAN training criteria (Metz et al., 2016; Arjovsky et al., 2017), which may alleviate the mode dropping problem.
We would like to thank Christian Bracher for his valuable feedback on the manuscript.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, 2016.