1.1 Mosaics and textures
Mosaics are a classical art form. The Romans were masters in skillfully selecting small colored stones to make beautiful mosaics of large scenes. Later, in his paintings, the Renaissance painter Archimboldo composed various objects to make amazing portraits of people. In general, mosaics work because of the properties of the human visual system to average colors over spatial regions – when looking from a distance the large image emerges, but when looking closely the details of the single tiles emerge. In modern times, computer graphic algorithms have enabled different forms of digital image mosaics photomosaic ; JIM . However, these methods use distinct non-overlapping small tiles to paint the large image. A seamless mosaic style like Archimboldo’s – where the whole image acts as a mosaic, without any tiles with borders – is visually closer to modern methods of texture synthesis and transfer.
Image quilting EfrosQ recombines patches from the original textures in order to smoothly reconstruct a target image – “texture transfer". However, a disadvantage is the high runtime complexity when generating large images. In addition, since instance models merely copy the original pixels, they cannot be used to generalize and create novel textures from multiple examples.
The work of GatysEB15a uses discriminatively trained deep neural network as effective parametric image descriptors, allowing both texture synthesis and a novel form of texture transfer called “neural art style transfer." However, texture synthesis and transfer is performed from a single example image and lacks the ability to represent and morph textures defined by several different images.
Spatial versions of Generative Adversarial Networks (GAN) are well suited to unsupervised learning of texturesSGAN2016 ; PSGAN2017 . The Periodic Spatial GAN (PSGAN) allows high quality texture synthesis, with efficient memory and speed usage. It can also leverage information from many input images and use them to learn a texture manifold, a rich distribution over many textures allowing morphing into novel textures. Such generative models can give more widely varied outputs than the instance and neural descriptor based approaches to texture synthesis. Such variety is key to our proposed method.
1.2 Introducing a new algorithm for mosaic creation: GANosaic
Our novel GANosaic method has two steps. First, a PSGAN is trained on a set of example images, for details see Section 2. Second, the generator of the PSGAN is used as a module in an optimization problem: generate an image that is as close as possible to a “content" image , while staying on the learned texture manifold, the “style". One way to do this is to adapt the input noise tensor s.t. the output image from the generator is as close as possible to the target image . This is done by defining the distance between and as a loss function and then optimizing it w.r.t. . However, during PSGAN training came from a prior noise distribution, and the mosaic optimization can push to values far away from the prior distribution and lead to degenerate looking textures.
For aesthetically pleasing mosaics we want to stay as close as possible to the texture manifold “style". When we use a PSGAN texture model, the mosaic style is better represented with a texture loss term that ensures that the input tensor111In our notation denotes both the noise random variable and a tensor sampled from it.
denotes both the noise random variable and a tensor sampled from it.stays close to its prior statistics used during PSGAN training. Concretely, we model the loss to ensure that the local channels keep their statistical independence. The total loss function is therefore composed of two parts, a content loss and a texture loss:
The term denotes the content loss ensuring reconstruction of by :
Here denotes the mean of all squared tensor elements. The mapping is the “correspondence map" EfrosQ , which specifies what perceptual distance metric we want to use w.r.t. the content image. This can be a simple predefined image transformation, or a more complex approach, e.g. utilizing the outputs of pretrained convolutional filters GatysEB15a . By using some image downscaling operator in (e.g. pooling layers) we split the frequencies of the resultant image: the low frequencies are determined by the content image, and the high frequencies come from the texture manifold. Such a split improves the mosaic quality, see Section 4.2 for ablation studies regarding the effects of the choice of .
The texture loss is also required to keep the optimized noise tensor close to the manifold of textures created by the prior noise distribution. It regularizes the local noise channels of the noise tensor . See Section 3 for more details on the loss and its effect on mosaic output.
In summary, the GANosaic is a powerful novel method to generate art, with the following key properties:
generation of seamless mosaics with unique texture visual aesthetics
flexible differentiable texture model (PSGAN2017, ) that learns and morphs diverse texture images
very large scalability with respect to output mosaic size – all calls to the generator can be efficiently split into small tensor chunks seamlessly forming a very large final image (SGAN2016, )
fast optimization in noise space by gradient descent
exploration of multiple different mosaics for given texture and content image
2 Background: texture generation via PSGAN
to distort a noise vector
, which is sampled from a standard distribution (e.g. uniform), such that the distorted probability distribution is close to the distribution observed through the training samples of the form. This is achieved with a game theoretic idea, by letting the generator network play against an additional network, the discriminator
: the task of the discriminator is to classify a sample as being from the generator or from the training set, while the generator tries to be as good as possible in producing samples that get classified by the discriminator as real training data.
The extensions of PSGAN beyond the standard GAN framework are threefold. First, as in spatial GANs SGAN2016 , the architecture is chosen to be a fully convolutional version of DCGAN RadfordMC15 and the noise vector is extended to a spatial tensor . Here, and are spatial dimensions, while
is the channel dimension. Hence, akin to DCGAN, the fractionally strided convolutions in PSGANs upsample the spatial dimensionsand to the output dimensions and . In our case we typically use 5 convolution layers with each a fractional stride of , hence the total upsampling in our case is .
As the discriminator is chosen symmetrically to , in particular also fully convolutional, the standard GAN cost function needs to be marginalized over the spatial classifications:
where is the discriminator output at location and . The key advantage of this approach is that the image patches used for the training minibatches can have different size than the image outputs used when sampling from the model, yielding arbitrary large output resolution. However, as the receptive field of a single location in the tensor is spatially limited in the output, far away regions are independently sampled. The local statistics must therefore be independent of the position – in other words, only sampling of a single texture is possible. To overcome this limitation, as a second extension, a fraction of channels in are spatially shared to allow for conditioning on a global structure. The final extension is to implement a spatial basis in parts of , which can be used to anchor image generation. It has been shown in PSGAN2017
that a plane wave parameterization for the spatial basis allows the generation of periodic textures, and can also lead to better quality of non-periodic textures. The wave numbers of the plane waves in PSGAN are given as functions of the global channels by a multi-layered perceptron, which is learned end-to-end alongsideand . In total the tensor consists of three parts: a local part , a global part , and a periodic part , concatenated in the channel dimension.
After learning, the spatially shared global channels define a texture to be sampled, while the independent samples in the local dimensions give rise to local pattern variation. Importantly, when the global channels are allowed to change smoothly in the spatial dimensions, this yields a spatial transitioning of textures, while being locally still plausible textures. Hence we speak of a texture manifold.
Figure 3 shows how the texture manifolds learned by PSGAN look. We use input texture images for PSGAN training of Sydney satellite images from Google Maps, stone wall images from Wikimedia Commons, and DTD “scaly" from cimpoi14describing .
Examples of the PSGAN texture models used for our mosaics. Morph plots show the ability of texture manifolds to smoothly change texture processes. The plots (size 960x960 pixels) created by bi-linearly interpolating thetensor (size ) between 4 random texture samples.
Technical note: fixed constant batch normalization in for GANosaic
of PSGAN was trained using batch normalization after every deconvolutional layer. The batch normalization calculates per layer statistics that capture the distribution of feature activations in the minibatch input to. However, the GANosaic method (optimization of w.r.t. a content and texture loss, is a tensor with 1 instance of a large spatial extent) is different than the PSGAN training (optimization of parameters w.r.t. GAN loss function,
has many instances of spatially small arrays from the noise prior distribution). We found empirically that GANosaic works better by using fixed statistics (mean and variance) for the batch normalization operations of the trained. Concretely, we pre-calculate the batch normalization statistics of a batch with many instances of sampled from the prior. Afterwards, these statistics are used as a constant rescaling for each batch normalization operation inside the network . This also makes easier the practical implementation of splitting procedures (SGAN2016, ) for very large mosaics, a key ability of GANosaic.
3 Texture loss
Optimization of w.r.t. the content loss can introduce spatial correlations between the local dimensions. During training of the texture model, however, the local dimensions
in the PSGAN model were independently sampled from the prior at every spatial position and channel dimension. Hence, the correlations imply a move away from the learned texture manifold. To remedy this, we introduce a regularization term: the key idea is that samples taken from the joint distribution of neighboring local dimensions should be distributed according to the prior distribution during training (up to finite sample size effects). In our case, as the prior is an independent uniform distribution, this means the samples should fill up the whole hypercube. In contrast, if local dimensions were perfectly correlated, the samples would lie exclusively on the diagonal of the hypercube.
To implement this idea we assume independence in the channel dimension and considered the different channels as the samples. By employing a kernel density estimate, the joint distribution can be estimated and compared to the prior distribution. For practical reasons, only pairwise neighboring positions inare evaluated. The restriction to neighboring positions can be justified by noting that correlations in natural images fall off monotonously with distance hyvarinen2009natural . The computational benefit is a reduction of quadratic to linear time complexity. Formally, we can write the texture loss term as:
where measures the distance of two probability distributions and , and the square brackets denote the concatenation of column vectors to a matrix. The set of spatial offsets determines for which neighboring positions the distribution is regularized. We took .
The kernel density estimate given and evaluated at a point is given as . Any valid kernel function can be used for ; we employed a Gaussian kernel. From the form of
we see that the target probability distribution of the regularizer is the convolution of the prior probability distribution with the Gaussian kernel, i.e..
Finally, the distance function between the probability distributions needs to be defined. We simply calculate the distance as the square difference between the two distributions evaluated on the set of grid points equally spaced in the unit cube, i.e. . This makes the regularizer a differentiable function w.r.t. . Figure 4 gives a toy example of the behavior of the regularizer. Note that the regularizer is similar to determinantal point processes kulesza2012determinantal , in particular the resulting samples tend to be too regular in comparison to samples from the prior.
4.1 Experimental setup
For the texture optimization procedure we used gradient optimization (BFGS from SciPy), and we constrained all values to lie in , to be compatible with the prior distribution. The speed for a single gradient step is 0.2 seconds for a 1024x768 pixel image, 0.4 seconds for 1600x1000 pixels, on a Tesla GPU K80. Usually less than 20 iterations are enough to get nice looking mosaic images. These timings apply for a simple perceptual distance map and our standard PSGAN architecture (see below) – using a neural network for can cost more time, depending on the layers and channels. In general, we expect that a very large content image and a very rich PSGAN model (trained on many textures) can make the optimization more complicated and require more iterations to optimize, but we did not inspect this in detail. With small texture sets of a few training images we got expressive PSGAN models that could be used for good-looking mosaics.
While our optimizer worked quite well when starting from a single texture (same on every spatial position), we found that doing a few iterations of stochastic search (via sampling random projections, see Section 4.2.1) and using the sampled mosaic with lowest loss as initialization can help the optimizer reach solutions slightly better than a trivial initialization from a random texture. In addition, by randomly sampling initializations for the optimizer, we are able to explore multiple different solutions to the mosaic loss optimization, which is another source of aesthetic variety.
4.2 Effects of the content loss correspondence map
The correspondence map used for the content loss encodes directly a choice of perceptual distance metric – it allows us to have flexibility in the transfer of the texture appearance on the mosaic. If we use the identity as map, then for some images the mosaic output will be degenerate. Figure 5(a) shows this drawback of the identity correspondence map when the content image has too high frequencies, which are difficult to map to the texture manifold. Adding downscaling (e.g. with an average pooling filter) to will emphasize the lower frequencies in the content image and lead to better texture appearance, see Figure5(b,c). Downscaling too much will make the mosaic reproduce the content image less accurately as in Figure 5(d), but as a trade-off the stone textures are really well recognizable.
Another interesting choice for may be to use instead of the exact colors the luminance channel of the RGB image, defined by us as the average of the 3 RGB channels. This is useful when the texture manifold is very different in color hue from the content image. Figure 6(a) shows an example mosaic with the luminance map, and it has more color variation than the RGB map 6(d).
The downscaling and color transformations are examples of manually specified correspondence maps . As an alternative, we can take filters from the pretrained VGG network Simonyan14c , which is similar to the approach of GatysEB15a . Figure 6(b,e,f) shows our results with VGG, which has a different aesthetics than the other choices of – this perceptual distance is more flexible w.r.t. color hue and also is more flexible w.r.t. spatial matching of the content. This is due to the VGG network architecture, which is deep (many convolutional layers) and wide (many filter channels) and uses pooling operations.
4.2.1 Disabling the mosaic content loss
A very specific choice of correspondence map will use a map which always outputs a constant value, equivalent to disabling the whole content loss term in the GANosaic loss. In that case, we introduce a simpler alternative method that can create mosaic images using a PSGAN texture generator. We can directly paint the noise global dimensions with values related to the pixels of the content image. E.g. we can use a random linear projection from pixels (3 channels) to the channels of the global noise tensor, followed by a nonlinearity to keep the values in . Concretely, given an image of size pixels, we can calculate a downscaled version of size pixels – the spatial resolution of the latent noise space of
. We can then sample a random matrixand calculate per spatial position . Afterwards we can generate an image with the projection and local noise from the prior.
While very simple, this approach is useful for exploration of texture manifolds and is very fast to calculate. Figure 7 shows an example: the random projection paints the low frequency segments of the content image, while exhibiting a lot of details from the texture manifold. This is useful for creating smoothly morphing videos illustrating random walks in the space of the texture manifolds, while projecting a specific content image, and usually preserving well the low frequencies of the content. Please see an example video of Che Guevara rendered with Sydney textures at youtu.be/4GAFQwE3kLs. As a drawback, the random projection mosaics completely ignore the high frequencies of the content image. The random projection output is also more unpredictable: some projections look better than others, but the fast generation speed (0.1s for the 2048x1536 pixel image in Figure 7) enables exploring many such images as artistic selection process.
4.3 Effects of the texture loss
Optimizing the local values together with the global can lead to lower content loss than tuning alone and fixing to a random sample from the prior. However, special care needs to be taken to keep the distribution of the close to the prior distribution and avoid degeneracy from the texture manifold – thus the texture loss term we defined in Section 3. Figure 8 shows graphically the mosaic quality obtained by optimizing and regularizing by using the texture loss. This term acts as a regularization term that is helpful for preventing degeneration. It also allows GANosaic to reach a better content loss by optimizing both and , rather than just .
Figure 9 shows plots of the convergence and emphasizes the optimization behaviour for the 3 different settings from the previous figure. Plot 9(a) shows the values of the content loss when optimizing with the different settings and we see how optimizing leads to lower content loss. The texture loss is effectively lowered when we set , see 9(c). For intuition how the tensors look, we display as images 4 channels from the tensor. In Figure 9(b) they are correlated with the content image and deviate from the prior. Such values of can lead to the degeneracy displayed in Figure 8(b). In 9(d) the effect of the term de-correlates from the content image and makes them locally consistent with the uniform prior, which corresponds to the mosaic from Figure 8(c).
This section contains a short discussion of the properties of GANosaic mosaics. Traditionally, photomosaic algorithms utilized large image datasets photomosaic ; JIM . Texture rendering is usually done with a single texture EfrosQ ; GatysEB15a . In contrast, GANosaic can use rich texture manifolds as style representations and allows the generation via convolutional neural networks of large mosaics smoothly rendering any content image. The generation of texture mosaics is achieved by optimization in the latent noise space of a PSGAN texture model. An application of the GANosaic can be consumer-facing apps that allow quick rendering of user content (images and video), using pre-trained texture models. Another more professional use case can be graphical design: the artist can use PSGAN and GANosaic as tools in the creative process:
select carefully a set of texture examples
train a PSGAN texture model on them
use the texture model on any content image to create high resolution art, suitable for posters.
As a practical tip, it is recommended to the user to regard the GANosaic process not as fully automatic, but semi-automatic – the artist can explore the results of different initializations and iterations of the optimizer, and select mosaics that show textures with the “right” aesthetic look.
The fast runtime and the ability to handle arbitrarily large output image resolutions comes from the spatial GAN architecture SGAN2016 used in the texture prior of GANosaic. The generation of can be easily split by calculating on separate subtensors of and combining the results. This property applies also to our optimization framework, since all the loss function terms can be calculated and aggregated in spatial chunks. Thus, mosaics of very high resolution, practically unlimited by memory (only by storage) can be made. The computation time scales linearly with the number of output pixels.
In order to minimize the mosaic loss in Equation 1, different spatial regions of the input noise tensor will converge to different values and thus the image output from the generator will be a mosaic of different texture processes. While this works well for the low frequencies in the content image, there is a limit to how high the frequencies in the output mosaic can be. The receptive field of the generator model determines the highest possible frequencies we can obtain in the mosaics by setting the generator input. Optimizing both the local and global dimensions improves the texture mosaic resolution and allows to paint finer details better fitting the content, but there is always some limit how high the mosaic image frequencies can go and what level of pixel detail is achievable.
Downscaling in the map can improve the image quality of the output mosaic (see Figure 5) – it acts like an averaging filter that removes the high frequencies from the content image. Thus, the content loss depends on the lower frequencies, and the high frequencies will be determined by the texture manifold. The choice of content image and its pixel size matter as well. Large content images have lower frequencies and are easier for mosaics. In practice, we can always upscale the input image to obtain larger mosaics with lower frequencies. Using a smaller texture brush relative to the image leads to a lower content loss, analogous to a large photomosaic with small tiles. On the other hand, smaller content images imply larger textures relative to them ( e.g. the works of Archimboldo) and this is a more challenging case for mosaic optimization.
The approach of GANosaic to locally fit the “right" texture is different than the neural art style approach GatysEB15a . Figure 10 shows an example of neural style transfer when using a style image with many textures. The style descriptor consists of the feature correlations marginalized over the spatial extent of the image, and style transfer will try to transfer the full distribution of the style image to the content – thus the mixed stone background. The pixel space optimization can preserve the high frequency details of the content image, but in some regions (e.g. the face in Figure 10 b)) it looks like merely painting directly the content, rather than trying to represent it with textures.
In contrast, GANosaic works by modifying only the input noise tensor to the generator network . This network acts as a regularizer that outputs images that are close to the underlying PSGAN texture manifold. GANosaic does not optimize directly the pixels of the output mosaic image, but only modifies as input of . This is a more constrained optimization problem since the noise space has much lower spatial size than the image . E.g. with with 5 layers an image with 1024x1024 pixels will be generated by a tensor of size 32x32 spatial positions.
In a related work, DumoulinSK16 explored the blending of multiple styles by parametrically mixing their statistical descriptors, but the result is a style averaging globally the properties of the input styles, rather than painting locally with different styles. In contrast, the GANosaic (e.g. as shown in Figure 5 with the same texture style) can focus on different modes from the texture distribution that well approximates the content image locally, and the regularization ensures that each local region is on the texture manifold.
The authors would like to thank Roland Vollgraf for useful discussions and feedback.
Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf.
Learning texture manifolds with the periodic spatial GAN.
Proceedings of The 34th International Conference on Machine Learning, 2017.
- (2) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In , 2014.
- (3) Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. CoRR, abs/1610.07629, 2016.
- (4) Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, 2001.
- (5) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/1508.06576, 2015.
- (6) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, 2014.
- (7) A. Hyvärinen, J. Hurri, and P. Hoyer. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer, 2009.
- (8) Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. Texture synthesis with spatial generative adversarial networks. CoRR, abs/1611.08207, 2016.
- (9) J. Kim and F. Pellacini. Jigsaw image mosaics. In Proc. of the 29th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, 2002.
- (10) Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012.
- (11) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
- (12) Robert Silvers. Photomosaics. Henry Holt and Co., Inc., 1997.
- (13) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.