Code for TileGAN: Synthesis of Large-Scale Non-Homogeneous Textures (SIGGRAPH 2019)
We tackle the problem of texture synthesis in the setting where many input images are given and a large-scale output is required. We build on recent generative adversarial networks and propose two extensions in this paper. First, we propose an algorithm to combine outputs of GANs trained on a smaller resolution to produce a large-scale plausible texture map with virtually no boundary artifacts. Second, we propose a user interface to enable artistic control. Our quantitative and qualitative results showcase the generation of synthesized high-resolution maps consisting of up to hundreds of megapixels as a case in point.READ FULL TEXT VIEW PDF
We propose a novel multi-texture synthesis model based on generative
Coarse building mass models are now routinely generated at scales rangin...
The real world exhibits an abundance of non-stationary textures. Example...
The last few years have witnessed the great success of non-linear genera...
The field of texture synthesis has witnessed important progresses over t...
In this work, we present a non-parametric texture synthesis algorithm ca...
Reconstructing 3D human faces in the wild with the 3D Morphable Model (3...
Code for TileGAN: Synthesis of Large-Scale Non-Homogeneous Textures (SIGGRAPH 2019)
Example-based texture synthesis is the task of generating textures that look similar to a given input example. The visual features of the input texture should be faithfully reproduced while maintaining both small-scale as well as global characteristics of the exemplar.
In this paper, we are interested in synthesizing large-scale textures that consist of multiple megapixels (see Fig. 1). The first challenge in large-scale texture synthesis is to process a large amount of input data. This is crucial because without a considerable amount of reference data, any generated output will not have a lot of variability and lack features at multiple scales. Such a synthesized output could be large-scale, but will be very homogeneous and boring or repetitive. Recent work in parametric texture synthesis using generative adversarial networks (GANs) seems ideally suited to tackle this challenge and we build on a recent GAN architecture that can generate high-quality results when trained on natural textures (Karras et al., 2018a). The second challenge in large-scale texture synthesis is how to generate large-scale output data. This is the core topic of this paper and we have identified two important sub-problems that we tackle in our work.
First, assuming that the selected GAN can only generate tiles of limited resolution, it is necessary to make these tiles match. There are multiple possible solutions to this problem that were explored in previous work. An elegant and powerful method is to compute graph cuts between overlapping tiles (Kwatra et al., 2003). While this method works well in some cases, very often it leads to artifacts when the blended tiles are not similar enough. Another possibility is to use a pixel-based texture synthesis algorithm like PatchMatch (Barnes et al., 2009) to repair seams between textures. A very simple method is to use blending. We propose a solution that is based on manipulating latent codes of lower resolution levels of the GAN to obtain nice transition regions. See Fig. 2 for a comparison of our method illustrated by blending four neighboring tiles.
Second, we need to be able to incorporate user input to control the visual appearance of the synthesized output. A major challenge for most existing texture synthesis methods is the artistic control over the final result. While patch-batch based texture synthesis techniques can be constructed to provide artistic control, such as painting by numbers (Hertzmann et al., 2001; Ritter et al., 2006; Lukáč et al., 2015; Lockerman et al., 2016), most existing GAN-based texture synthesis approaches provide no (or minimal) artistic control. In this paper, we propose a solution based on latent brushes and other intuitive editing tools that allow easy global control on large scale texture maps. Please see the accompanying video for examples.
Technically, the major contribution of our paper is to provide a framework to take a GAN of limited resolution as a building block and produce a possibly infinite output texture. As a practical result, we are able to significantly improve the quality and speed of the state of the art in large-scale texture synthesis.
We review the literature most related to our work sorted into multiple categories and refer the reader to Akl et al. (2018) for a more comprehensive and recent literature review on example-based texture synthesis.
Existing non-parametric texture synthesis algorithms try to synthesize a new texture such that each patch in the output texture has an approximate match in the input texture (Kwatra et al., 2005). A very important ingredient for these algorithms is a fast correspondence algorithm such as PatchMatch (Barnes et al., 2009) that is employed in most state-of-the-art texture synthesis algorithms, e.g. (Kaspar et al., 2015; Huang et al., 2015). PatchMatch can be extended to create faster queries (Barnes et al., 2015) or additional error metrics (Darabi et al., 2012; Kaspar et al., 2015; Zhou et al., 2017). While existing methods provide strong visual results, we propose to build on recent work in deep learning that shows a much greater promise with regards to the scalability of the considered input data or the size of the output due to faster synthesis speed. A notable earlier algorithm proposed a hierarchical extension using an earlier version of non-parametric synthesis (Han et al., 2008), but this algorithm is specific to the texture synthesis algorithm it employs (Lefebvre and Hoppe, 2005) and it cannot be easily adapted to a deep learning framework.
A popular early approach to texture synthesis was to extract features and feature statistics from an input texture and then try to create a new texture that would match these feature statistics (Heeger and Bergen, 1995; De Bonet, 1997; Portilla and Simoncelli, 2000)2015b; 2015a) proposed the idea to use inner products between feature layers at different levels of the network as a texture descriptor. For each layer of the network, each pair of features gives one inner product to compute. This idea was expanded by Sendik and Cohen-Or (2017)
, who introduced a structural energy term based on correlations between deep features, thus capturing self-similarities and regularities in the structural composition of the texture. The technique proposed by Snelgrove et al.(2017) presents an early effort to increase the maximum size of texture features that can be synthesized using the method of Gatys et al. (2015b)
. This is accomplished by matching a small number of network layers across many scales of a Gaussian pyramid leading to improved synthesized textures. Instead of using gradient-based optimization to compute new textures, it is also possible to train a generator using the feature statistics for the loss function(Dosovitskiy and Brox, 2016).
Generative adversarial networks (GANs) were introduced in a seminal paper by Goodfellow et al. (2014). Over the years, the architecture of GANs has improved significantly and state-of-the-art GANs can produce results of stunning visual quality (Karras et al., 2018a; Brock et al., 2018; Karras et al., 2018b; Zhang et al., 2018). In our work, we have chosen to build on the framework proposed by Karras et al. (2018a). They introduced a progressively growing architecture that starts the training on a low-resolution exemplar and slowly increases the size of the networks, as well as the exemplars. Their network is able to produce semantically coherent image content at a significantly higher resolution than previous work. Zhou et al. (2018) introduced a technique to expand textures while preserving challenging structural arrangements by iteratively training a GAN on sub-blocks of the input textures. While this work uses GANs, it only uses a single image as input by generating many different crops during training. The convolutional nature of GANs can be exploited to synthesize images and textures of output sizes different from the image resolution the GAN was trained on (Jetchev et al., 2016; Bergmann et al., 2017). GANosaic (Jetchev et al., 2017) extends such methods to generate textures by optimizing the latent noise space to produce textures that match the overall content of a given guidance image. However, such methods are limited in the type of textures they support, the expected size, and the overall variability and quality of the output. The image stylization method FAMOS (Jetchev et al., 2018) improves on the quality by training the texture GAN and the guidance image styling network at the same time. While this method produces smoother transitions between texture patches, it still suffers from issues relating to scalability and variability, which we try to address in our method.
and super-resolution(Wang et al., 2018c)
. While traditional GANs generate images starting from a random vector, the GAN training can be extended to the problem of image-to-image translation using either paired or unpaired training data(Isola et al., 2017; Zhu et al., 2017a; Zhu et al., 2017b; Huang et al., 2018). In computer graphics, recent papers apply GANs to the synthesis of caricatures of human faces (Cao et al., 2018), the synthesis of human avatars from a single image (Nagano et al., 2018), texture and geometry synthesis of building details (Kelly et al., 2018), surface-based modeling of shapes (Ben-Hamu et al., 2018) and the volumetric modeling of shapes (Wang et al., 2018a). The most related problem to our work is the problem of terrain synthesis (Guérin et al., 2017).
In this section, we present the three main components of our framework in a high-level overview:
Our framework requires a generative model that produces novel images. State-of-the-art GANs typically consist of two networks: a generator and a discriminator. The generator network produces sample images which match the training distribution using convolutional layers that gradually increase the spatial resolution of a random latent vector to a full-size image. The discriminator network assesses how well the generated samples match the training distribution. The two networks are constructed to be differentiable and their gradients are used to guide the training of the full generative model. Our main focus in this paper is on combining multiple outputs of the generator network of a standard GAN for large scale texture synthesis. More details on the GANs and data sets we use in our experiments are given in Sec. 6.
Our key contribution is a method to synthesize plausible large-scale non-homogeneous textures using a pretrained generator network. This is accomplished by generating a tiling of compatible intermediate latent vectors, which we call the latent field , that the generator network uses to produce a coherent large-scale texture (see Fig. 3). The intermediate latent vectors can be efficiently sampled and stored for analysis and online processing. In order to ensure that the synthesized textures are globally coherent, we optimize the latent field to satisfy two main objectives. First, the expected synthesized output should follow an initial small target guidance map for the expected large-scale synthesized image. This map can be randomly generated or specified by the user. Second, in order to afford local coherence and minimize abrupt texture changes between neighboring texture tiles, we optimize the latent field by replacing problematic tiles with better candidates that are more compatible with their neighbors. The details of our entire synthesis pipeline will be presented in Sec. 4.
We propose a set of tools to facilitate user control over our texture synthesis process. The key idea behind the control our method affords during synthesis lies in modifying the latent field. To that end, we utilize operations such as painting, shuffling, copying, and target image matching, all of which enable different ways of artistic control. We will discuss details about our interactive tool in Sec. 5.
We first describe the general notation used in this paper before describing the different phases of our framework. We start by redefining the generator network, from a standard deep convolutional GAN, as:
where is a randomly sampled latent vector and specifies the intermediate level at which we plan to perform our latent field synthesis. Lastly, and are two parts of that split the set of convolutional layers at the level (see Fig. 3). For a GAN with levels, takes the latent vector at level 1 as input and produces a tile at level and takes a tile at level as input and produces a color image at level , the final level.
Our large-scale non-homogeneous texture synthesis framework is divided into three phases: (1) a one-time preprocessing phase, (2) an online latent field synthesis phase, and (3) an online texture synthesis phase.
The first step of preprocessing is to create a large set of texture samples that are generated using the generator network from a standard deep convolutional GAN. Each sample
comprises two components: (1) an intermediate tensor, which we refer to as a latent tile where is the level at which we will synthesize the latent field, and (2) a downsampled version of the texture map , where represents its spatial resolution. The greater the number of samples in , the more texture variability is afforded by our framework. The second step in this phase is to cluster the texture samples in by their visual appearance, using , in order to enable fast lookup of visually similar latent tiles. We perform the clustering using -means and assign cluster centers as representative texture samples.
The second phase of our framework is the synthesis of large compositions of GAN-generated textures with no apparent visual artifacts, seams, or obvious repetition. We use a variant of the Markov Random Fields (MRF) model for texture synthesis applied on the latent field . While the MRF model has been applied to texture colors (Wei et al., 2009) as well as texture statistics (Li and Wand, 2016), we are the first to propose an application to GAN latent vectors. The goal of such an MRF model can be redefined for our framework as follows: given a large set of individual texture tiles sampled from a single distribution, synthesize a large scale output of texture tiles so that for each output tile, its spatial neighborhood is similar to some neighborhood from the input distribution. With this MRF assumption, the similarity of the local neighborhood between input and output help ensure an overall coherent texture map with minimal boundary artifacts.
In order to efficiently generate textures at large scale, we perform the latent field synthesis in two steps: an initialization step and an iterative refinement step. Alg. 1 formalizes the entire process of our framework for latent field synthesis. Splitting the computational task of synthesizing the texture facilitates interactive editing. The first step is typically computed on the order of seconds and is immediately presented to the user. The refinement step is computed on a background process that regularly updates the latent field and displays the final output. Alg. 2 represents how we find better candidates in the refinement step.
In the initializing step, we aim to efficiently generate a tiling of the latent field that approximately satisfies the guidance map . The map provides global content control. At this stage, we assume that the texture samples , generated at the latent level , and its clustering result were computed in a prior preprocessing step. We start by performing a latent-tile-based texture synthesis to cover all unassigned tiles in the output latent field . For each unassigned tile, we find the single top matching using the unary energy term defined below. We repeat this processing until no unassigned latents remain.
Refinement steps are performed until the total latent field’s energy is lower than our set threshold. This stopping criterion is currently set empirically to a value that ensures that a desirable variety of visual features in the tiling is preserved during the MRF refinement. In each step, we randomly sample a latent tile in and check for candidate tiles that minimize the local energy. We define the optimal latent field as the field that minimizes the following energy of weighted unary and binary terms:
The unary term is the sum of visual similarities of of a candidate tile with its corresponding region in . In our experiments, we consider the Euclidean distance of the two images as the similarity measure in order to accelerate this computation. The binary term considers the 4-connected neighboring latent tiles for each tile by the following weighted dissimilarity terms:
These terms represent the dissimilarity between a tile and another tile in the set of its 4-connected neighbors: visual appearance , latent vector representation , and cluster membership . For every pair of latent tiles in the 4-connected neighborhood, we approximate and using the Euclidean distance of their overlapping region. The dissimilarity measure is computed on the corresponding of each tile while is computed on the corresponding latent tensor . The last term is the average agreement of cluster membership where we assign a 0 to pairs with matching clusters and 1 to non-matching pairs. The different energy terms are weighted by the corresponding weight parameter. We set as 1 and and as 0.5 in our experiments. When finding the top matches in the refinement step, we first return the top matching tiles using and then compute the entire energy after placing each candidate tile.
An important aspect when combining tiles from different samples of is the unpredictability at their joining region (see Fig. 7
). In our experiments, we have noticed that latents that fall on the outer most regions of the latent tile exhibit lots of instability. This is likely due to a bias caused by the zero padding that is applied in. In order to minimize this effect, we only consider a cropped version for each sample of . The size of the cropped latent tile influences the overall visual coherence of neighboring output regions, where smaller latent tiles exhibit a smoother feature blending than larger tiles. In our experiments, we have typically used latent tile sizes of to regardless of the merge level . While we crop the latent blocks, we use the entire representative image for comparison with the guidance map, which creates an overlapping sliding-window effect when finding tile matches, thereby further enhancing the coherence of neighboring tiles.
Selecting the level at which to split the GAN is a trade-off between the quality of the transition region and the scope of the region impacted by the transition changes (see Fig. 4). We have mainly experimented with splitting at earlier levels to in our work because these parameters yield the best visual results according to our judgment.
|Result||Training data set||Output||Merge level||Latent tile size||Synthesis||User editing||Total|
|Medieval Island (Fig. 9 top)||4700||94||2||15.0||1||16.0|
|Dinosaur Park (Fig. 9 bottom)||17000||22||2||4.5||5||9.5|
|Mountain Painting (Fig. 10 top)||7800||620||2||25.0||0||25.0|
|Space Panorama (Fig. 10 bottom)||5000||6.5||5||1.2||10||11.2|
|STTO (Fig. 6 top)||0.27||1||—||—||9.4||—||9.4|
|NSTS (Fig. 6 top)||0.27||1.08||—||—||2000||—||2000|
|TileGAN (Fig. 6 top)||17000||9.8||2||3.0||0||3|
|STTO (Fig. 6 bottom)||0.27||1||—||—||10.25||—||10.25|
|NSTS (Fig. 6 bottom)||0.27||1.08||—||—||2000||—||2000|
|TileGAN (Fig. 6 bottom)||17000||9.8||2||3.0||0||3|
The final synthesized image is generated by taking a latent field and applying the trained generator network . Using this multi-stage process, the network is able to output arbitrarily large results. A latent field of size will result in an RGB texture of size , where is the number of levels in the pyramid. We use , for most of our experiments. Since the generating function is convolutional, we can profit in two ways. First, it can be efficiently applied for local re-synthesis. Second, arbitrarily large latent fields can be processed by multiple overlapping applications of where the overlapping parts of the output of each application of is discarded.
Our texture synthesis framework can fully automatically generate plausible results. However, as with most texture synthesis scenarios, user control and interactive editing are highly desirable. We have developed an interactive tool with different editing operations for our GAN-based image synthesis approach, see Fig. 5.
We provide two major sets of editing operations: (1) directly manipulating the latent field, (2) editing the guidance map.
The first editing operation for manipulating the latent field allows the user to drag and drop a tile from a list of clusters (Fig. 5, bottom) onto an existing latent field. The inserted latent tile is randomly sampled from the latent tiles belonging to the selected cluster. In order to visualize the expected tile appearance, we show the user a representative image corresponding to the cluster centers
. This tool can be generalized as a GAN-based paintbrush of variable size, which offers a high degree of user control. The second editing operation is a cloning tool, where we take parts of existing content from the synthesized image and clone the respective latents onto other regions. We provide an option to spatially shuffle the cloned tiles to add more diversity to the cloned region. Moreover, we can add small amounts of noise or interpolate between two latent tiles to allow for additional degrees of variability. These simple latent manipulation tools provide local control of the resulting output.
Finally, the appearance of the output texture can be influenced by modifying the guidance map using traditional image manipulation techniques. This capability provides virtually limitless variations in the size of the output, shading, placement of features, etc.
Our GAN architecture and training is based on the approach of Karras et al. (2018a), called Progressive Growing of GANs (ProGAN). We have slightly modified the generator architecture to extract the intermediate latents at any arbitrary layer of the GAN. These latent tiles can be modified and merged to generate large-scale non-homogeneous textures. We have chosen ProGAN over other GAN architectures because it consistently produces high-quality output images and because the architecture consisting of a stack of identical building blocks facilitates the division into the two parts that allow us to manipulate the intermediate latent field.
While the training process may take on the order of days training on multi-GPUs to reach tiles of acceptable quality, the synthesis and editing steps of the texture generation are possible at interactive rates running on a machine with a single GPU. The preprocessing step is done only once and typically requires 30 minutes to sample a set of size latent tiles and then cluster the corresponding representative images into clusters. For each data set, we train the GAN on four NVIDIA v100 GPUs for
iterations for around 4 days. We use the default optimizer and training schedule provided by the official TensorFlow(Abadi et al., 2015) implementation from (Karras et al., 2018a)111github.com/tkarras/progressive_growing_of_gans.
We have compiled various training data sets for experimentation with our method. Popular existing data sets like CelebA or LSUN are not suitable for large-scale texture synthesis. Therefore, we have curated our own test data from several sources of publicly available large-scale imagery, all of which are processed as image tiles of size :
Terrain map. We collect tiles of the terrain basemap provided by Google Maps.
Satellite imagery. We use samples from the tiles of Landsat satellite images.
Oil canvas. We also consider high-resolution images of smaller objects including tiles from the detailed Gigapixel image of the Vincent van Gogh’s The Starry Night provided by the Google Art Project.
Night sky. We sample a total of high-resolution tiles from the European Southern Observatory and from the Hubble Space Telescope image repository.
For all data sets, we do not perform any alignment steps or augmentation of the input training tiles.
In this section, we present a qualitative and quantitative analysis of the results created using our method. All synthesis results are generated using our interactive tool written in Python and running on a desktop machine equipped with an Intel Xeon 3.00GHz CPU with 32GB RAM and a single NVIDIA TITAN Xp GPU with 12GB memory. Note that in order to handle results with hundreds of megapixels, we resort to the caching of the latent field and synthesizing the image in chunks by the maximum supported block that fits on the GPU one block at a time.
We use our framework on the data sets described in Sec. 6 and showcase selected results in Fig. 9 and Fig. 10 to demonstrate the quality and variability afforded by our method. Tab. 1 shows the corresponding statistics for each result. We argue that our method can generate large-scale non-homogeneous textures with high visual quality that surpasses the quality achieved by any other published method.
We compare our method to two other state-of-the-art algorithms. We selected self-tuning texture optimization (STTO) (Kaspar et al., 2015), which we believe to be the state-of-the-art texture synthesis algorithm not using neural networks and non-stationary texture synthesis (NSTS) (Zhou et al., 2018), a recent neural network-based algorithm. We take a single texture tile of resolution as input for the STTO and NSTS algorithms. For both techniques, we used the recommended settings provided by the authors. To generate our results, we use the fully trained network and apply the input image as a guidance map. We present a comparison of the different methods in Fig. 6. Even though we could verify that STTO generates excellent results for a large variety of textures, in these tests we can see that STTO is unable to handle the multi-scale features present in the aerial image and produces a highly repetitive output. This demonstrates that challenging issues related to multi-scale texture synthesis have not been fully explored. We also verified our installation of the NSTS code released by the authors by replicating the results in their paper before running it on aerial images. However, NSTS is also not able to produce high-quality results when taking aerial images as input. We believe that multi-scale texture synthesis requires a large amount of input and that previous work is inherently not suitable to generate multi-scale content of high quality. By contrast, we can generate images of high visual quality with few artifacts.
We also provide a comparison of the running time and scale for the synthesis of various examples and methods (see Table 1). As shown in the table, self-tuning texture optimization does not scale as well with respect to the input data size and the output data size as our method. Non-stationary texture synthesis spends a lot of time on analyzing a small input texture, but this is not a suitable strategy for large-scale and multi-scale texture synthesis. Our method is faster than the competing methods if we exclude our preprocessing times with the justification that the preprocessing time would be amortized over many applications of a single trained GAN generator.
Our framework is able to generate high-resolution textures on multiple data sets. However, there are still several limitations to be considered for future work.
The largest output that we are able to generate on our testing machine is about 1.6 Gigapixels. Synthesizing and viewing larger images would require the implementation of additional memory management procedures.
Not all latent tiles generate good looking results, see Fig. 8 for a failure example where our result contains visual artifacts. Such failures might be attributed to training, sampling of the data set, or inherent limitations of the chosen GAN implementation. Since our framework is fairly modular, we believe that we can integrate new GANs easily.
In addition, the blending of two tiles can lead to unpredictable results. For example, two forest tiles next to each other can generate a visible boundary of different color in the transition region that was not visible in either tile when generated by itself. Fig. 7 depicts this failure case, where a merging of apparently visually similar tiles (such as the depicted mixture of arctic ice and continental ice) generates undesirable transition artifacts when attempting to merge them. Our framework tries to minimize these unpredictable results at the refinement step, but it is not able to eliminate them completely. Furthermore, such refinement may lead to decreased diversity due to increased repeated similar tiles if the MRF optimization is run till convergence (Kaspar et al., 2015). To help avoid outputs with repeated visually similar latent tiles we stop the optimization early. When matching features during the optimization, visual artifacts occurring by directly selecting the highest ranked feature could be reduced by incorporating implicit diversity in the MRF as a regularization loss (Wang et al., 2018b). Alternative strategies to the MRF optimization considering latent tile usage (Jamriška et al., 2015) or incorporating diversity encouraging feature distance metrics (Mechrez et al., 2018) are worth exploring in future work.
Our results can occasionally exhibit grid-like artifacts, as illustrated in Figure 8. This undesirable effect is especially noticeable when initializing a grid tiling completely randomly or when tiling challenging regions where the MRF doesn’t manage to select good fitting neighbors and the edge transitions become overly visible. The tiling may also exhibit some repetitiveness when the guidance map contains large regions of uniform color, which are tiled by very similar latent tiles, as visible in the background of Figure 5.
In this work, we have tackled the problem of texture synthesis in the setting where many input images are given and a large-scale output is required. We have built on recent advances in high-quality generative adversarial networks and proposed a fast algorithm to tile outputs of GANs to produce large plausible texture maps with virtually no boundary artifacts. We have also proposed an interface that enables local and global artistic control on the output image. Our early quantitative and qualitative results demonstrate the fast generation of high-quality textures consisting of hundreds of megapixels. As far as we know, our work is the first to attempt to seamlessly combine intermediate latent tiles at different levels of a GAN to interactively generate such large texture synthesis results.
One interesting venue for future work is to experiment on datasets from other celestial bodies (e.g., Mars, Pluto, Sun) and close-ups of everyday objects captured at Gigapixel levels. We are also interested in applying our technique to data with depth information or multiple channels. Another venue for future work is adopting a stacking of multi-layer GANs in order to generate more realistic guide maps that can also be possibly created from an upper layer GAN. Furthermore, the idea of simply manipulating a latent field to produce large textures can be exploited to quickly and consistently modify global appearance including applying color transformation or global patterns. We believe that the idea of latent vector manipulation can lead to many innovations in the future of texture synthesis.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.tensorflow.org.
Texture Synthesis Using Convolutional Neural Networks.In Advances in Neural Information Processing Systems 28.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Image-to-image translation with conditional adversarial networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR).