Spatial Generative Adversarial Networks
Generative adversarial networks (GANs) are a recent approach to train generative models of data, which have been shown to work particularly well on image data. In the current paper we introduce a new model for texture synthesis based on GAN learning. By extending the input noise distribution space from a single vector to a whole spatial tensor, we create an architecture with properties well suited to the task of texture synthesis, which we call spatial GAN (SGAN). To our knowledge, this is the first successful completely data-driven texture synthesis method based on GANs. Our method has the following features which make it a state of the art algorithm for texture synthesis: high image quality of the generated textures, very high scalability w.r.t. the output texture size, fast real-time forward generation, the ability to fuse multiple diverse source images in complex textures. To illustrate these capabilities we present multiple experiments with different classes of texture images and use cases. We also discuss some limitations of our method with respect to the types of texture images it can synthesize, and compare it to other neural techniques for texture generation.READ FULL TEXT VIEW PDF
Spatial Generative Adversarial Networks
Generative adversarial networks (GANs)  are a recent approach to train generative models of data, which have been shown to work particularly well on image data. In the current paper we introduce a new model for texture synthesis based on GAN learning. By extending the input noise distribution space from a single vector to a whole spatial tensor, we create an architecture with properties well suited to the task of texture synthesis, which we call spatial GAN (SGAN). To our knowledge, this is the first successful completely data-driven texture synthesis method based on GANs.
Our method has the following features which make it a state of the art algorithm for texture synthesis: high image quality of the generated textures, very high scalability w.r.t. the output texture size, fast real-time forward generation, the ability to fuse multiple diverse source images in complex textures. To illustrate these capabilities we present multiple experiments with different classes of texture images and use cases. We also discuss some limitations of our method with respect to the types of texture images it can synthesize, and compare it to other neural techniques for texture generation.
A texture can be defined as an image containing repeating patterns with some amount of randomness. More formally, a texture is a realization of a stationary ergodic stochastic process .††Our source code is available at https://github.com/zalandoresearch/spatial_gan
The goal of visual texture analysis is to infer a generating process from an example texture, which then allows to generate arbitrarily many new samples of that texture - hence performing texture synthesis. Success in that task is judged primarily by visual quality and closeness to the original texture as estimated by human observers, but also by other criteria which may be application specific, e.g. the speed of analysis and synthesis, ability to generate diverse textures of arbitrary size, the ability to create smoothly morphing textures.
Approaches to do that fall in two broad categories. Non-parametric techniques resample pixels  or whole patches  from example textures, effectively randomizing the input texture in ways that preserve its visual perceptual properties. They can produce high quality images, but have two drawbacks: (i) they do not "learn" any models of the textures of interest but just reorder the input texture using local similarity, and (ii) they can be time-consuming when large textures should be synthesized because of all the search routines involved. There are methods to accelerate example-based techniques , but this requires complicated algorithms.
The second category of texture synthesis methods is based on matching statistical properties or descriptors of images. Texture synthesis is then equivalent to finding an image with similar descriptors, usually by solving an optimization problem in the space of image pixels. The work of Portilla and Simoncelli  is a notable example of this approach, which yields very good image quality for some textures. Carefully designed descriptors over spatial locations, orientations, and scales are used to represent statistics over target textures.
Gatys et al. 
present a more data driven parametric approach to allow generation of high quality textures over a variety of natural images. Using filter correlations in different layers of the convolutional networks – trained discriminatively on large natural image collections – results in a powerful technique that nicely captures expressive image statistics. However, creating a single output texture requires solving an optimization problem with iterative backpropagation, which is costly – in time and memory.
. Instead of doing the costly optimization of the output image pixels, they utilize powerful deep learning networks that are trained to produce images minimizing the loss. A separate network is trained for each texture of interest and can then quickly create an image with the desired statistics in one forward pass.
A generative approach to texture synthesis 
uses a recurrent neural network to learn the pixel probabilities and statistical dependencies of natural images. They obtain good texture quality on many image types, but their method is computationally expensive and this makes it less practical for texture generation in cases where size and speed matter.
We will present a novel class of generative parametric models for texture synthesis, using a fully convolutional architecture trained employing an adversarial criterion.
improved the GAN architecture by using deep convolutional layers with (fractional) stride8]. Overall, GANs are powerful enough to generate natural looking images of high quality (but low pixel resolution) that can confuse even human observers .
However, in GANs the size of the output image (e.g. 64x64 pixels) is hard coded into the network architecture. This is a limitation for texture synthesis, where much larger textures and multiple sizes may be required. Laplacian pyramids have been used to generate images of increasing size , gradually adding more details to the images with stacked conditional GANs. However, that technique is still limited in the output image sizes it can handle because it needs to train GAN models with increasing complexity for each scale in the image pyramid. The scale levels must also be specified in advance, so the method cannot create output of arbitrary size.
In our work we will input images of textures (possibly of high pixel resolution) as the data distribution that the GAN must learn. However, we will modify the DCGAN architecture  to allow for scalability and the ability to create any desired output texture size by employing a purely convolutional architecture without any fully connected layers. The SGAN architecture is especially well suited for texture synthesis of arbitrary large size, a novel application of adversarial methods.
In the experiments in Section 4 we will examine these points in detail. In the next section we describe the spatial GAN architecture.
The key idea behind Generative Adversarial Networks  is to simultaneously learn a generator network and a discriminator network . The task of is to map a randomly sampled vector from a prior distribution to a sample in the image data space. The discriminator outputs a scalar representing the probability that is from real training data and not from the generator
. Learning is motivated from game theory: the generator
tries to fool the discriminator into classifying generated data as real one, while the discriminator tries to discriminate real from generated data. As bothand adapt over time, generates data that gets close to the input data distribution.
The SGAN generalizes the generator to map a tensor to an image , see Figure 2. We call and the spatial dimensions and the number of channels. Like in GANs, is sampled from a (simple) prior distribution: . We restricted our experiments to having each slice of at position and , i.e. , independently sampled from , where with and . Note that the architecture of the GAN puts a constraint on the dimensions – if we have a network with convolution layers with stride
and same zero padding (see e.g.) then .
Similarly to the way we extended the generator, the discriminator maps to a two-dimensional field of size containing probabilities that indicate if an input (or ) is real or generated. In order to apply the SGAN to a target texture , we need to use to define the true data distribution . To this end we extract rectangular patches from the image at random positions. We chose to be of the same size as the samples of the generator - otherwise GAN training failed in our experiments. For the same reason we chose symmetric architectures for and , i.e. has the same spatial dimensions as .
Both the generator and the discriminator are derived from the architecture of . In contrast to the original architecture however, spatial GANs forgo any fully connected layers - the networks are purely convolutional. This allows for manipulation of the spatial dimensions (i.e. and ) without any changes in the weights. Hence, a network trained to generate small images is able to generate a much larger image during deployment, which matches the local statistics of the training data.
We optimize the discriminator (and the generator) simultaneously over all spatial dimensions:
Note that the model describes a stochastic process over the image space. In particular, as the generator is purely convolutional and each is identically distributed and independent of its location and , the generated data is translation-invariant. Hence the process is stationary with respect to translations.
To show that the process is also strong mixing, we first need to define the projective field (PF) of a spatial patch of as the smallest patch of the image which contains all affected pixels of under all possible changes of . In full analogy, we refer to the receptive field (RF) of a patch in as the corresponding minimal patch in which affects it. Assume then two non-overlapping patches from the generated data, and . Additionally, take their respective projective fields to be non-overlapping - this can be always achieved as projective fields are finite, but the array can be made arbitrarily large. The generated data in and is then independently generated. The process is hence strong mixing (with the length scale of the projective field), which implies it is also ergodic.
For the following experiments, we used an architecture close to the DCGAN setup : convolutional layers with stride
in the generator, convolutional layers with stride 2 in the discriminator, kernels of size 5x5 and zero padding. We used a uniform distribution forwith support in . Depending on the size and structure of the texture to be learned, we used networks of different complexity, see Table 1. We used networks with identical depths in and . The sizes of filter banks of were chosen to be in reverse to those of , yielding more channels for smaller spatial representations. We denote with SGANx that we have x layers depth in and . We applied batch normalization on all layers, except the output layer of , the input and output layers of . All network weights were initialized as 0-mean Gaussians with .
We tried different sizes for the image patches . Note that the spatial dimensions of and are dependent, and . Both setting or and adjusting for the respective depending variables worked similarly well, despite different relative impact of the zero padded boundaries.
Our code was implemented in Theano and tested on an Nvidia Tesla K80 GPU. The texture generation speeds of a trained SGAN with different architectures and image sizes are shown in Table2. Generation with is very fast, as is expected for a single forward pass. Forward pass generation in TextureNet  is significantly slower (20ms, according to their publication) than SGAN (5ms) for the 256 pixel resolution, despite the fact that TextureNet uses fewer filters (8 to 40 channels per convolution layer) than we do (64 to 512 filters). The simpler SGAN architecture avoids the multiple scales and join operations of TextureNet, rendering it more computationally efficient. As expected, the method of Gatys  is orders of magnitude slower due to the iterative optimization required.
There are initial time costs for training the SGAN on the target textures. For optimization we used ADAM  with parameters as in  and 32 samples per minibatch. For the simple textures from Section 4.2.1, subjective assessment indicates that generated images are close to their final quality after roughly 10 minutes of training. The more complex textures of Sections 4.2.2 required around 30 minutes of training.
Training times TextureNet  requires a few hours per texture. We could not compare this directly with SGAN on the same machine and textures, but we assume that SGAN trains more efficiently than TextureNet because of its simpler architecture.
The exact time and number of iterations required for training SGANs depends on the structure of the target texture. A general problem in GAN training is that it is often required to monitor the results and stop training – as occasionally overtraining may lead to degeneracy or image quality degradation.
A common way to measure quality of texture synthesis is to visually evaluate the generated samples. A setup with small input textures allows a direct comparison between SGAN and the method of Gatys . For these examples, we used an SGAN with 4 layers. Figure 3 shows our results for textures coming from a stationary and ergodic process. The textures of radishes and stones (top rows) are also mixing, while the letters (bottom row) are mixing horizontally, but not vertically. The texture synthesis of both SGAN and Gatys fail to preserve the row regularity of the source image.
Satellite images provide interesting examples of macroscopic structures with texture-like properties. City blocks in particular resemble realizations from a stationary stochastic process: different city blocks have similar visual properties (color, size). Mixing occurs on a characteristic length scale given by the major streets.
Figure 1 shows that our method works better than  on satellite images, here concretely a single image of Barcelona.email@example.com,2.1551133,292m/data=!3m1!1e3 SGAN creates a city-like structure, whereas Gatys’ method generates less structure and detail. We interpret this in the following way: the SGAN is trained specifically on the input image and utilizes all its model power to learn the statistics of this particular texture. Gatys 
relies on pretrained filters learned on the ImageNet dataset, which generalize well to a large set of images (but apparently not well to satellite imagery) and fails to model salient features of this city image, in particular the street grid orientation.
To indicate the superior quality of SGAN for that texture the spatial auto-correlation (AC) of the original and synthesized textures from Figure 1 are shown on Figure 4. We calculate the AC on whole images of size 1371x647 pixels. The AC of the original and the SGAN5 generation are similar to one another and show clearly the directions of the street grid. In contrast, the AC of Gatys’ texture looks more isotropic than the original, indicating loss of visually important information.
Figure 5 illustrates the effects of different network depths on the SGAN generated outputs. More layers and larger receptive fields as in SGAN6 allow larger structures to be learned and longer streets emerge, i.e., there is less mixing and more regularity at a given scale.
The GAN approach to texture synthesis can combine multiple example texture images in a natural way. We experimented with the flowers dataset222www.robots.ox.ac.uk/~vgg/data/flowers/ containing 8189 images of various flowers, see Figure 6 (a) for examples. We resized each image to 160 pixels in the -dimension, while rescaling the -dimension to preserve the aspect ratio of the original image. Then we trained an SGAN with 5 layers; each minibatch for training contained 128x128 pixel patches extracted at random positions from randomly selected input flower images. This is an example of a non-ergodic stochastic process since the input images are quite different from one another.
A sample from the generated composite texture is shown in Figure 6 (b). The algorithm generates a variety of natural looking flowers but cannot blend them smoothly since it was trained on a dataset of single flower images. Still, the final result looks aesthetic and such fusion of large image datasets for texture learning has great potential for photo mosaic applications.
In another experiment we learned a texture representing the 5 satellite images shown on Figure 6 (c). The input images depict areas in the Old City of Amsterdam with different prevailing orientations. Figure 6 (d) demonstrates how the generated texture has orientations from all 5 inputs. Although we did not use any data augmentation for training, the angled segments join smoothly and the model learns spatial transitions between the different input textures. Overall, the Amsterdam city segments come from a more ergodic process than the flowers example, but less ergodic than the Barcelona example.
GAN based methods can fuse several images naturally, as they are generative models that capture the statistics of the image patches they’re trained with. In contrast, methods with specified image statistical descriptors [12, 5] generate textures that match a single target image closely in these descriptors. Extending these methods to several images is not straight-forward.
The spatial dimensions of are locally independent – output image pixels depend only on a subset of the input noise tensor . This property of the SGAN allows two practical tricks for creation of output textures with special properties. Below we illustrate these tricks briefly, see Appendix I for details.
Seamless textures are important in computer graphics, because arbitrarily large surfaces can be covered by them. Suppose we want to synthesize a seamless texture of desired size and generating it would require spatial dimensions of for a given SGAN model. Let be the ratio of the spatial dimensions of to . For notation we use Python slicing notation, where indicates indices before the end of the array in the 1st dimension, the ‘4:’ indicates all elements but the first 4 along the second dimension, and all elements along the last, not explicitly indexed, dimension. We should sample a slightly bigger noise tensor and set its edges to repeat: and . Then we can calculate and crop pixels from each border, resulting in an image of size that can be tiled in a rectangular grid as shown on Figure 7.
In addition, we can use the SGAN generator to create textures in a memory efficient way by splitting the calculation of into independent chunks that use less memory. This approach allows for straightforward parallelization. A potential application would be real-time 3D engines, where the SGANs could produce the currently visible parts of arbitrary large textures.
Suppose again that we have . Let us split in two along dimension . We can call the generator twice, producing and , with each call using approximately half the memory than a call to the whole . To create the desired large output, we concatenate the two partially generated images, cropping their edges: . With this approach, the only limitation is the number of pixels that can be stored, while the memory footprint in the GPU is constant. Appendix I has precise analysis of the procedure.
Our SGAN synthesizes textures by learning to generate locally consistent image patches, thus making use of the repeating structures present in most textures. The mixing length scale of the generation depends on the projective field sizes. Choosing the best architecture depends on the specific texture image. This holds also for the algorithm of Gatys et al. 
, where the parametric expressiveness of the model depends on the set of network layers used for the descriptive statistics calculation.
Rather than using handcrafted features or other priors to specify the desired properties of the output, the adversarial framework learns the relevant statistics only from information contained in the training texture. In contrast, parametric methods specify statistical descriptors a priori, before seeing the image . This includes models using the wavelet transform  or the properties of the filters of a neural network trained on a large image dataset . Such models can generalize to many textures, but are not universal – sometimes it is better to train a generative model for a single texture, as our examples from Section 4.2.2 show.
 is an interesting case because it describes a generative model that takes as inputs noise vectors and produces images with desired statistics. This is analogous to the generator
in the GAN. However, features extracted from pre-trained discriminative networks (as in) play the role of a discriminator function, in contrast to learned discriminators in adversarial frameworks.
The examples in Sections 4.2.1 and 4.2.2 show that our method deals well with texture images, corresponding to realizations of stationary ergodic stochastic processes that mix. It is not possible to learn statistical dependencies that exceed the projective field size with SGANs.
Synthesizing regular non-mixing textures (e.g. a chess grid pattern) is problematic for some methods . The SGAN cannot learn such cases - distant pixels in the output image are independent from one another because of the strong mixing property of the SGAN generation, as discussed in Section 3. For example, the model learns basic letter shapes from the text image in Figure 3 (bottom row), but fails to align the "letters" into globally straight rows of text. The same example shows that the approach of Gatys  has similar problems synthesizing regular textures. In contrast, the parametric method of  and non-parametric instance-based methods work well with regular patterns.
To summarize the capabilities of our method:
real-time generation of high quality textures with a single forward pass
generation of images of any desired size
processing time requirements scale linearly with the number of output pixels
combination of separate source images into complex textures
seamless texture tiles
As next steps, we would like to examine modifications allowing the learning of images with longer spatial correlations and strong regularity. Conditional GAN  architectures can help with that, as well as allow for more precise control over the generated content. Conditioning can be used to train a single network for a variety of textures simultaneously, and then blend them in novel textures.
In future work, we plan to examine further applications of the SGAN such as style transfer, mosaic rendering, image completion, in-painting and modeling 3D data. In particular, 3D data offers intriguing possibilities in combination with texture modeling – e.g. in machine generated fashion designs. The SGAN approach can also be applied to audio data represented as time series, and we can experiment with novel methods for audio waveform generation.
We would like to thank Christian Bracher, Sebastian Heinz and Calvin Seward for their valuable feedback on the manuscript.
Proceedings of the International Conference on Computer Vision, 1999.
Texture synthesis using convolutional neural networks.In Advances in Neural Information Processing Systems 28, 2015.
Proceedings of the 32nd International Conference on Machine Learning, 2015.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, 2016.
We examine in detail the projective fields (PF) and which inputs from map to which output pixels in . Let indicate a range starting at index inclusive and ending at index exclusive, which we will also call left and right border. For convenience of notation, we will write only 1D indices for the square PFs, e.g. will refer to the square field in python slicing notation. For simplicity, we express the formulas valid only for the architecture we usually used for SGAN, 5x5 kernels and convolutional layers with stride .
We start by examining the recursive relation between input and output of a fractionally strided convolutional layer.
An input has as its PF an output after applying one convolutional layer. It holds that and .
This relation holds because of the way we implement a convolutional layer with stride in Theano, just like DCGAN  does. Note that which is exactly relationship 13 from  between input and output sizes of a transposed convolution.
We can rewrite the recursive relations as a function of the initial size and count of layers :
An input has as its PF an output after convolutional. layers. It holds that and .
In particular, we get the PF size of a single for as , which we denote in Table 1 as PF/RF.
With these relations, we can show why we can split without any loss of information the calculation of a big array in two smaller volumes as in Section 4.3.2.
Let with . We can define and . Then it holds that where .
Since we split only on one of the spatial dimensions, it is enough to reason only for intervals in that dimension and not the whole 2D field. Image in the Python slicing notation is the same as image , and its rightmost pixel has index . The PF of has a left border . The calculation of the pixel is not influenced by any further elements from , i.e. for any it holds that . This would mean that is exactly equal to , the left half of the desired image . By a similar argument, the left-most pixel of is not influenced by , and is equal to , the right half of the desired image . This proves that .
This proof shows why an overlap of 2 spatial dimensions of is sufficient for splitting. Is it also necessary? The answer is yes, since is the smallest set required to calculate : the PF of has left border and right border , so the pixel is inside the PF of .
A similar proof can be made for the seamless, i.e. periodic, texture case from Section 4.3.1, and we sketch it here. Making the periodic in its spacial dimensions makes the output periodic as well. As in the previous case, we need for each border an overlap of 2, hence we need to make 4 elements along each dimension to be identical, i.e. we set and . More of the periodic structure in is not needed as it would be redundant. The output image is . It is easily shown that the leftmost and rightmost pixels and fit together as required for a seamless texture. These pixels use information from elements of that are equal numerically, . The relative positions of the pixel inside the PF of the volume is offset exactly by 1 pixel compared to the position of inside the PF of .