Transposer: Universal Texture Synthesis Using Feature Maps as Transposed Convolution Filter

07/14/2020 ∙ by Guilin Liu, et al. ∙ 2

Conventional CNNs for texture synthesis consist of a sequence of (de)-convolution and up/down-sampling layers, where each layer operates locally and lacks the ability to capture the long-term structural dependency required by texture synthesis. Thus, they often simply enlarge the input texture, rather than perform reasonable synthesis. As a compromise, many recent methods sacrifice generalizability by training and testing on the same single (or fixed set of) texture image(s), resulting in huge re-training time costs for unseen images. In this work, based on the discovery that the assembling/stitching operation in traditional texture synthesis is analogous to a transposed convolution operation, we propose a novel way of using transposed convolution operation. Specifically, we directly treat the whole encoded feature map of the input texture as transposed convolution filters and the features' self-similarity map, which captures the auto-correlation information, as input to the transposed convolution. Such a design allows our framework, once trained, to be generalizable to perform synthesis of unseen textures with a single forward pass in nearly real-time. Our method achieves state-of-the-art texture synthesis quality based on various metrics. While self-similarity helps preserve the input textures' regular structural patterns, our framework can also take random noise maps for irregular input textures instead of self-similarity maps as transposed convolution inputs. It allows to get more diverse results as well as generate arbitrarily large texture outputs by directly sampling large noise maps in a single pass as well.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 9

page 10

page 11

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Texture synthesis is defined as the problem of generating a large image output given a small example input such that the visual features and structures are preserved both locally and globally. Many methods have been explored in the past two decades including pixel-based methods (Efros and Leung, 1999), assembling based methods (Efros and Freeman, 2001; Kwatra et al., 2003), optimization based methods (Kwatra et al., 2005; Kaspar et al., 2015), etc.

Inspired by the unprecedented success of deep learning in computer vision, others have explored deep learning methods for texture synthesis. Existing works fall into one of two categories. Either an optimization procedure is used to match deep feature statistics in a pre-trained network 

(Gatys et al., 2015b; Li et al., 2017b), resulting in a slow generation process; or a network is trained to overfit on a fixed image or set of images (Li et al., 2017a; Zhou et al., 2018; Shaham et al., 2019), which prevents it from generalizing to unseen textures and needs to spend huge re-training time for every unseen texture image.

One reason for the bad generalization ability of these aforementioned methods (Li et al., 2017a; Zhou et al., 2018; Shaham et al., 2019)

is because these one-model-per-image (set) approaches usually employ conventional image-to-image translation networks, which first embed the input into a feature space and then fully rely on a sequence of upsampling and convolutional layers to reach the target output size. Each upsampling and convolutional layer is a local operation lacking of global awareness. This design works well for tasks such as image super resolution, where the task is to enhance or modify local details. However, texture synthesis differs from super resolution in that texture synthesis, when viewed from a classical perspective, involves displacing and assembling copies of the input texture using different optimal offsets in a seamless way. The optimal displacement and assembling strategy involves much longer-range operations and compatibility checking, which are usually not easy to model with the conventional design by fully relying on a sequence of local up/down-sampling and (de)convolutional layers.

In the column pix2pixHD of Figure 1, we show that a conventional image-to-image translation network adapted from pix2pixHD (Wang et al., 2018) fails to perform reasonable texture synthesis, but instead mostly just enlarges the local contents for the input textures even though it has been trained to convergence using the same input and output pairs as our method.

In this paper, we propose a new deep learning based texture synthesis framework that generalizes to arbitrary unseen textures and synthesizes larger-size outputs. From a classical view, the texture synthesis task can also be interpreted as the problem of first finding an appropriate offset to place a copy of the input texture image and then using optimization technique to find the optimal seam between this newly placed copy and the existing image to assemble them together. Our method follows some similar spirits but have some major differences in the following manner: 1) We perform assembling in feature space and at multiple scales; 2) The optimal shifting offset and assembling weights are modeled with the help of a score map, which captures the similarity and correlation between different regions of the encoded texture image. We call this score map a self-similarity map (discussed in details in Section 3); 3) We later show that the shifting and assembling operations can be efficiently performed with a single forward pass of a transposed convolution operation (Dumoulin and Visin, 2016), where we directly use the encoded feature of input textures as transposed convolution filters, and the self-similarity map as transposed convolution input. Unlike traditional transposed convolution, our transposed convolution filters are not learnable parameters. While self-similarity map plays a key role in preserving the regular structural patterns, alternately, our framework also allows to take random noise map as input instead of self-similarity map to generate diverse outputs and arbitrarily large texture output with a single shot by accordingly sampling large random noise map for irregular structural texture inputs.

In this work, we make the following contributions: 1) We present a generalizable texture synthesis framework that performs faithful synthesis on unseen texture images in nearly real time with a single forward pass. 2) We propose a self-similarity map that captures the similarity and correlation information between different regions of a given texture image. 3) We show that the shifting and assembling operations in traditional texture synthesis methods can be efficiently implemented using a transposed convolution operation. 4) We achieve state-of-the-art texture synthesis quality as measured by existing image metrics, metrics designed specifically for texture synthesis, and in user study. 5) We show that our framework is also able to generate diverse and arbitrarily large texture synthesis results by sampling random noise maps.

2. Related Work

We provide a brief overview of the existing texture synthesis methods. A complete texture synthesis survey can be found in (Wei et al., 2009), which is out of the scope of this work.

Non-parametric Texture Synthesis. Existing texture synthesis methods include pixel-based methods (Efros and Leung, 1999; Wei and Levoy, 2000), assembling based methods (Efros and Freeman, 2001; Liang et al., 2001; Kwatra et al., 2003; Pritch et al., ), optimization based methods (Portilla and Simoncelli, 2000; Kwatra et al., 2005; Rosenberger et al., 2009; Kaspar et al., 2015), appearance space synthesis (Lefebvre and Hoppe, 2006), etc. There are also some other works (Hertzmann et al., 2001; Zhang et al., 2003; Wu and Yu, 2004; Lefebvre and Hoppe, 2006; Rosenberger et al., 2009; Wu et al., 2013) showing interesting synthesis results; however, they usually need additional user manual inputs.

Among these traditional methods, self-tuning texture optimization (Kaspar et al., 2015) is the current state-of-the-art method. It uses image melding (Darabi et al., 2012) with automatically generated and weighted guidance channels, which helps to reconstruct the middle-scale structures in the input texture. Our method is motivated by assembling based methods. (Kwatra et al., 2003) is a representative method of this kind, where texture synthesis is formulated as a graph cut problem. The optimal offset for displacing the input patch and the optimal cut between the patches can be found by solving the graph cut objective function, which sometimes could be slow.

Deep Feature Matching-based Texture Synthesis. Traditional optimization based methods (Portilla and Simoncelli, 2000; Kwatra et al., 2005; Rosenberger et al., 2009; Kaspar et al., 2015)

rely on matching the global statistics of the hand-crafted features defined on the input and output textures. Recently, some deep neural networks based methods have been proposed as a way to use the features learned from natural image priors to guide the optimization procedure. Gatys et al. 

(Gatys et al., 2015b) define the optimization procedure as minimizing the difference in gram matrices of the deep features between the input and output texture images. Sendik et al. (Sendik and Cohen-Or, 2017) and Liu et al. (Liu et al., 2016) modify the loss proposed in (Gatys et al., 2015b) by adding a structural energy term and a spectrum constraint, respectively, to generate structured and regular textures. However, in all cases, these optimization-based methods are prohibitively slow due to the iterative optimizations.

Learning-based Texture Synthesis. Johnson et al. (Johnson et al., 2016) and Ulyanov (Ulyanov et al., 2016) alleviate the previously mentioned optimization problem by training a neural network to directly generate the output, using the same loss as in (Gatys et al., 2015b). This setup moves the computational burden to training time, resulting in faster inference time. However, the learned network can only synthesize the texture it was trained on and cannot generalize to new textures.

A more recent line of work (Zhou et al., 2018; Shaham et al., 2019; Li and Wand, 2016; Li et al., 2017a; Frühstück et al., 2019; Jetchev et al., 2016; Bergmann et al., 2017; Alanov et al., 2019) has proposed using Generative Adversarial Networks (GANs) for more realistic texture synthesis while still suffering from the inability to generalize to new unseen textures.

Zhou et al. (Zhou et al., 2018) learn a generator network that expands texture blocks into output through a combination of adversarial, , and style (gram matrix) loss. Li et al. and Shaham et al. (Li and Wand, 2016; Shaham et al., 2019) use a special discriminator that examines statistics of local patches in feature space. However, even these approaches can only synthesize a single texture which it has been trained on.

Other efforts (Li et al., 2017a; Jetchev et al., 2016; Bergmann et al., 2017; Alanov et al., 2019; Frühstück et al., 2019) try to train on a set of texture images. During test time, the texture being generated is either chosen by the network (Jetchev et al., 2016; Bergmann et al., 2017) or user-controlled (Li et al., 2017a; Alanov et al., 2019). (Frühstück et al., 2019) propose a non-parametric method to synthesize large-scale, varied outputs by combining intermediate feature maps. However, these approaches limit generation to textures available in the training set, and thus are unable to produce unseen textures out of the training set.

Li et al. (Li et al., 2017b) apply a novel whitening and colorizing transform to an encoder-decoder architecture, allowing them to generalize to unseen textures, but rely on inner SVD decomposition which is slow. Additionally, it can only output texture images with the same size as the input.

Yu et al. (Yu et al., 2019)

perform the interpolation between two or more source textures. While forcing two source textures to be identical can convert it to the texture synthesis setting, it will reduce the framework to be more like a conventional CNN. Besides probably suffering from the issues of conventional CNNs, its main operations of per-feature-entry shuffling, retiling and blending would greatly break the regular or large structure patterns in the input.

Other Image-to-Image Tasks. GANs (Goodfellow et al., 2014) have also been used in other image-to-image tasks (Isola et al., 2016; Wang et al., 2018; Dundar et al., 2020). Ledig et al. (Ledig et al., 2016) used it to tackle the problem of super-resolution, where detail is added to a low-resolution image to produce high-definition output. In contrast to these tasks, the texture synthesis problem is to synthesize new, varied regions similar to the input, and not to provide more details to an existing layout as in (Ledig et al., 2016) or translate the texture to a related domain as in (Isola et al., 2016)

. Other recipes like converting texture synthesis to an image inpainting problem usually cannot get satisfying results as they usually cannot handle big holes where we need to do the synthesis.

Similarity Map. Our framework relies on computing the self-similarity map, which is similar in spirit to the deep correlation formulation in (Sendik and Cohen-Or, 2017). The difference is that (Sendik and Cohen-Or, 2017) computes a dot product map between the feature map and its spatially shifted version and uses it as a regularizing term in their optimization objective; in contrast, we aggregate all the channels’ information to compute a single-channel difference similarity map and use it to model the optimal synthesis scheme in the network with a single pass.

3. Our Approach

Figure 2. (a) shows how the self-similarity map is computed. (b) shows how to perform the transposed convolution operation. Both (a) and (b) are the components used in our overall framework (c), shown with green and blue colors, respectively. Full animation of (a) and (b) can be found in the supplementary video. In (c), yellow boxes represent the features in the encoder. The encoded features in the last three scales are first used to compute the self-similarity maps, as shown in (a). We then perform the transposed convolution operation as shown in (b), where encoded features are used as transposed convolution filters to convolve the self-similarity maps. The convolved outputs are then used in the decoder to generate the final image.

Problem Definition: Given an input texture patch, we want to expand the input texture to a larger output whose local pattern resembles the input texture pattern. Our approach to this problem shares some similar spirits with the traditional assembling based methods which try to find the optimal displacements of copies of the input texture, as well as the corresponding assembly scheme to produce a large, realistic texture image output. We will first formulate the texture expansion problem as a weighted linear combination of displaced deep features at various shifting positions, and then discuss how to use the transposed convolution operation to address it.

Deep Feature Expansion: Let be the deep features of an input texture patch, with , and being the number of channels, the height, and width, respectively. We create a spatially expanded feature map, for instance by a factor of 2, by simply pasting and accumulating into a space. This is done by shifting along the width axis with a progressive step ranging from 0 to , as well as along the height axis with a step ranging from 0 to . All the shifted maps are then aggregated together to give us an expanded feature map .

To aggregate the shifted copies of , we compute a weighted sum of them. For instance, to calculate the feature , we aggregate all possible shifted copies of that fall in the spatial location

. While previous approaches rely on hand crafted or other heuristics to aggregate the overlapping features, in our approach, we propose to weight each shifted feature map with a similarity score that quantifies the semantic distance between the original

and its shifted copy. Finally, aggregation is done by simple summation of the weighted features. Mathematically, can be given by

(1)

where , , , , , is the similarity score of -shifting, and is the projection of ’s -shifted copy on the grid. Namely, , with , , and .

We compute the similar score of current -shifting using the overlapping region based on the following equation:

(2)

Here, and indicate the overlapping region between current -shifted copy and the original copy. is the L2 norm of . The dominator is used for denormalization such that the scale of is independent of the scale of . Figure 2(a) shows how the self-similarity score is computed at shifting . Note that self-similarity map is not symmetric with respect to its center as the dominator of Equation 2 is not symmetric with respect to the center. Full animation of computing the self-similarity map can be found in the supplementary video. We will apply some simple transformation on before using it in Equation 1, specifically one convolutional layer and one activation layer in our implementation.

Figure 3. Self-similarity Maps. We show the input texture images and the visualization of their self-similarity maps at 3 different scales (). The first texture image exhibits more obvious self-similarity patterns at the second scale, while other three texture images exhibits more obvious self-similarity patterns at the first scale.

As shown in Equation 2, the similarity score for a shift of along the width and height axis, respectively, is calculated as the L2 distance between the un-shifted and shifted copies of the feature map, normalized by the norm of the un-shifted copy’s overlapping region. So, a shift of gives the maximum score because there is no shifting and it exactly matches the original copy. Computing self-similarity maps can be efficiently implemented with the help of existing convolution operations. Details are discussed in the supplementary file.

We compute the self-similarity maps at multiple scales. Different texture images may exhibit more obvious self-similarity patterns on a specific scale than other scales, as shown in Figure 3.

Feature (Texture) Expansion via Transposed Convolution Operation: Note that the process of pasting shifted feature maps and aggregating them to create larger feature maps is equivalent to the operation of a standard transposed convolution in deep neural networks. For the given filter and input data, a transposed convolution operation simply copies the filter weighted by the respective center entry’s data value in the input data into a larger feature output grid, and perform a summation. In fact, our proposed Equation 1

is similar with a transposed convolution. Specifically, we apply transposed convolutions with a stride of

, treating the feature map as the transposed convolution filter, and the similarity map , given by Equation 2, as the input to the transposed convolution. This results in an output feature map of size . Figure 2(b) shows how the transposed convolution is done using the encoded input texture as filters and the first entry in the self-similarity map as input. Full animation of the transposed convolution operation can be found in the supplementary video.

3.1. Architecture

Figure 2(c) illustrates our overall texture synthesis framework. It relies on a UNet-like architecture. The encoder extracts deep features of the input texture patch at several scales. We then apply our proposed transposed convolution-based feature map expansion technique at each scale. The resulting expanded feature map is then passed onto a standard decoder layer. Our network is fully differentiable, allowing us to train our model with stochastic gradient-based optimizations in an end-to-end manner. The four main components of our framework in Figure 2(c) are:

  1. Encoder: Learns to encode the input texture image into deep features at different scales or levels.

  2. Self-Similarity Map Generation: Constructs guidance maps from the encoded features to weight the shifted feature maps in the shift, paste and aggregate process of feature map expansion (see Equation 2 and Figure 2(a)).

  3. Transposed Convolution Operation: Applies spatially varying transposed convolution operations, treating the encoded feature maps directly as filters and the self-similarity maps as inputs to produce expanded feature maps, as shown in Figure 2(b). Note that, unlike traditional transposed convolution layers, ours transposed convolution filters are not learning parameters. More details about the difference between our transposed convolution operation and traditional transposed convolution layer can be found in the suppmental file.

  4. Decoder: Given the already expanded features from the transposed convolution operations at different scales, we follow the traditional decoder network design that uses standard convolutional layers followed by bilinear upsampling layers to aggregate features at different scales, and generate the final output texture, as shown in the last row of Figure 2(c).

As described above, our proposed texture expansion technique is performed at multiple feature representation levels, allowing us to capture both diverse features and their optimal aggregation weights. Unlike previous approaches that rely on heuristics or graph-base techniques to identify the optimal overlap of shifted textures, our approach formulates the problem as a direct generation of larger texture images conditioned on optimally assembled deep features at multiple scales. This makes our approach desirable as it is data-driven and generalizable for various textures.

3.2. Loss Functions

input no perceptual loss no style loss non GAN loss full loss
Figure 4.

Ablation study on the components of loss functions.

During training, given a random image with size (2, 2), denoted as , its center crop image with size (,) will be the input to the network, denoted as . We train the network to predict an output image with the size (2, 2). Both VGG-based perceptual loss, style loss and GAN loss are used to train the network. The perceptual loss and style loss are defined between and at the full resolution of ; meanwhile, the GAN loss is defined on the random crops at the resolution of . Details are discussed below.

VGG-based perceptual loss and style loss. Perceptual loss and style loss are defined following Gatys et al. (Gatys et al., 2015a).

The perceptual loss and style loss are defined as:

(3)
(4)

Here, is the number of entries in . The perceptual loss computes the distances between both and

, but after projecting these images into higher level feature spaces using an ImageNet-pretrained VGG-19 

(Simonyan and Zisserman, 2014). is the activation map of the th selected layer given original input . We use feature from -nd, -th, -th, -st and

-th layers corresponding to the output of the ReLU layers at each scale. In Equation (4), the matrix operations assume that the high level features

is of shape , resulting in a Gram matrix, and is the normalization factor for the th selected layer.

GAN loss. The discriminator takes the concatenation of and a random crop of size from either or as input. Denote the random crop from as and the random crop from as

. The intuition of using concatenation is to let the discriminator learn to classify whether

and is a pair of two similar texture patches or not. We randomly crop 10 times for both and and sum up the losses.

Ablation Study. These 3 losses are summed up with the weights of 0.05, 120 and 0.2 respectively. We find that all of them are useful and necessary. As shown in Figure 4, without perceptual loss, the result just looks like the naive tiling of the inputs; no style loss makes the border region blurry; and no GAN loss leads to obvious discrepancy between the center region and the border region.

4. Experiments and Comparisons

input transposer(ours) self-tuning pix2pixHD SinGAN Non-stat. WCT DeepText. Text. Mixer ground truth
Figure 5. Results of different approaches on to texture synthesis. For SinGAN() and Non-stat. results, the first two rows show the results when training with direct access to exact-size ground truth; the remaining 2 rows show the results without them accessing the ground truth. In this paper, unless specified, our results (transposer) uses self-similarity map as transposed convolution inputs by default.

4.1. Dataset & Training

To train our network, we collected a large dataset of texture images. We downloaded 55,583 images from 15 different texture image sources (Cimpoi et al., 2014; sharan2014flickr; Dai et al., 2014; Burghouts and Geusebroek, 2009; Center for Machine Vision Research, ; Picard et al., 2010; Abdelmounaime and Dong-Chen, 2013; Fritz et al., 2004; Mallikarjuna et al., 2006). The total dataset consists of texture images with a wide variety of patterns, scales, and resolutions. We randomly split the dataset to create a training set of 49,583 images, a validation set of 1,000 images, and a test set of 5,000 images. All generation results and evaluation results in the paper are from the test set. When using these images, we resize them to the target output size as the ground truth and the center cropping of it as input.

Our network utilizing the transposed convolution operation is implemented using the existing PyTorch interface without custom CUDA kernels. We trained our model on 4 DGX-1 stations with 32 total NVIDIA Tesla V100 GPUs using synchronized batch normalization layers 

(Ioffe and Szegedy, 2015). For 128128 to 256

256 synthesis, we use batch size 8 and trained for 600 epochs. The learning rate is set to be 0.0032 at the beginning and decreased by

every 150 epochs. For 256256 to 512512 synthesis, we fine-tuned the model based on the pre-trained one for 128128 to 256256 synthesis for 200 epochs. While directly using 128 to 256 synthesis pre-trained model generates reasonable results, fine-tuning leads to better quality.

4.2. Baseline & Evaluation Metrics

Time Properties
Method 256x256 512x512 Generalizability Size-increasing
Self-tuning(Kaspar et al., 2015) 140 s 195 s Good Yes
Non-stationary(Zhou et al., 2018) 362 mins 380 mins No Yes
SinGAN(Shaham et al., 2019) 45 mins 100 mins No Yes
DeepTexture(Gatys et al., 2015b) 13 mins 54 mins No No
WCT(Li et al., 2017b) 7 s 14 s Medium Yes
pix2pixHD (Wang et al., 2018) 11 ms 22 ms Medium Yes
Texture Mixer (Yu et al., 2019) - 799 ms Medium Yes
transposer(ours) 43 ms 260 ms Good Yes
Table 1. Time required for synthesis at different spatial resolutions for various approaches and their corresponding properties. For Non-stationary and SinGAN, the reported time includes training time. All methods are run on one NVIDIA Tesla V100, except for Self-tuning which runs the default 8 threads in parallel on an Intel Core i7-6800K CPU @ 3.40GHz.

Baselines. We compare against several baselines: 1) Naive tiling which simply tiles the input four times; 2) Self-tuning (Kaspar et al., 2015), the state-of-the-art optimization-based method; 3) pix2pixHD (Wang et al., 2018), the state-of-the-art image-to-image translation network where we add one more upsampling layer to generate an output 2x2 larger than the input; 4) WCT (Li et al., 2017b) is the style transfer method; 5) DeepTexture (Gatys et al., 2015b), an optimization based using network features, for which we directly feed the ground truth as input; 6) Texture Mixer (Yu et al., 2019), a texture interpolation method where we set the interpolation source patches to be all from the input texture; 7) Non-stationary (Non-stat.) (Zhou et al., 2018) and SinGAN (Shaham et al., 2019), both of which overfit one model per texture. We train Non-stat. and SinGAN for two versions respectively; one version with direct access to the exact ground truth at the exact target size, and one version without access to target-size ground truth but only the input. In the paper, will correspond to methods that either directly take ground truth images for processing or are overfitting the model to ground truth.

input transposer(ours) self-tuning pix2pixHD SinGAN Non-stat. DeepTexure Text. Mixer GroundTruth
Figure 6. Results of different approaches on to texture synthesis. For SinGAN() and Non-stat. results, the first two rows show the results when training with direct access to exact-size ground truth; the remaining 2 rows show the results without them accessing the ground truth.

Table 1 shows the runtime and corresponding properties for all the methods. Compared with Self-tuning, our method is much faster. In contrast to Non-stat. and SinGAN, transposer (ours) generalizes better and hence does not require per image training. Comparing with DeepTexture and the style transfer method WCT, our method is still much faster without the need of iterative optimization or SVD decomposition. Even though pix2pixHD is faster than our method, it cannot perform proper texture synthesis as shown in Figures 5 and 6, same as Texture Mixer (Yu et al., 2019).

Evaluation Metrics. To the best of our knowledge, there is no standard metric to quantitatively evaluate texture synthesis results. We use 3 groups of metrics (6 in total):

  1. Existing metrics include SSIM (Wang et al., 2004), Learning Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) and Fréchet Inception Distance (FID) (Heusel et al., 2017). SSIM and LPIPS are evaluated using image pairs. FID measures the distribution distance between the generated image set and the ground truth image set in feature space.

  2. Crop-based metrics designed for texture synthesis evaluation include crop-based LPIPS (c-LPIPS) and crop-based FID (c-FID). While the original LPIPS and FID are computed on full-size images, c-LPIPS and c-FID operate on crops of images. For c-FID, we crop a set of images from the output image and crop the other set from the ground truth image, and then compute the FID between these two sets (we use a dimension of 64 for c-FID instead of the default 2048 due to a much smaller image set). For c-LPIPS, we compute the LPIPS between the input image and one of the 8 random crops from the output image, and average the scores among the 8 crops.

  3. User Study. Another way to measure the performances of different methods is by performing user study. We thus use Amazon Mechanical Turk (AMT) to evaluate the quality of synthesized textures. We perform AB tests where we provide the user the input texture image and two synthesized images from different methods. We then ask users to choose the one with better quality. Each image is viewed by workers, and the orders are randomized. The obtained preference scores (Pref.) are shown in Table 3, which indicate the portion of workers that prefer our result over the other method.

input transposer(ours) input transposer(ours) input transposer(ours)
Figure 7. More of our results on 44 times larger synthesis. Zoom in for more details.

4.3. Comparison Results

4.3.1. Evaluating synthesis of size 128 to 256

SSIM FID c-FID LPIPS c-LPIPS
Naive tiling 0.311 23.868 0.5959 0.3470 0.2841
Self-tuning 0.3075 33.151 0.5118 0.3641 0.2970
pix2pixHD 0.318 26.800 0.5687 0.3425 0.2833
WCT 0.280 57.630 0.4347 0.3775 0.3226
transposer (ours) 0.386 21.615 0.4763 0.2709 0.2653
Ground Truth 1 0 0.1132 0 0.2670
Table 2. Synthesis scores for different approaches averaged over 5,000 images.

We compare with Self-tuning, pix2pixHD and WCT on the whole test set of 5,000 images and show the quantitative comparisons in Table 2. It is noticeable that our method outperforms Self-tuning and pix2pixHD for all the metrics.

Due to the fact that Non-stat., SinGAN and DeepTexture are too slow to evaluate on all 5,000 test images, we randomly sampled 200 from the 5,000 test images to evaluate them. The visual comparison is shown in Figure 5. The numerical evaluation results are summarized in Table 3. As shown in 2nd-8th rows of Table 3, our method significantly out-performs all the methods which do not directly take ground truth as input. When compared with Self-tuning, we achieve better LPIPS score (0.273 vs. 0.358), and 63 of people prefer the results generated by our method over the ones generated by Self-tuning. The remaining rows of Table 3 also show that our method performs better than other size-increasing baselines (Non-stat. and SinGAN) and performs better or similar to DeepTexture, which all take ground truth as input. For instance, 51 people prefer our results over the ground truth and 46 of people prefer our results over DeepTexture, which directly takes ground truth for its optimization.

SSIM FID c-FID LPIPS c-LPIPS Pref.
Naive tiling 0.289 77.54 0.552 0.349 0.287 -
Self-tuning 0.296 101.75 0.464 0.358 0.292 0.63
Non-stat. 0.321 143.31 2.728 0.3983 0.3436 0.92
SinGAN 0.337 212.30 1.375 0.3924 0.3245 0.81
pix2pixHD 0.299 93.70 0.456 0.354 0.292 0.66
WCT 0.280 126.10 0.401 0.375 0.300 0.67
Texture Mixer 0.311 211.78 1.997 0.399 0.334 0.89
transposer(ours) 0.437 74.35 0.366 0.273 0.272
Ground Truth 1 0 0.112 0 0.270 0.51
Non-stat. 0.767 73.72 2.149 0.1695 0.3276 -
SinGAN 0.492 88.14 1.137 0.2467 0.2939 -
DeepTexture 0.289 67.89 0.289 0.336 0.298 0.46
Table 3. Synthesis scores for different approaches averaged over 200 images.

4.3.2. Evaluating synthesis of size 256 to 512

We also evaluate on 256256 image to 512512 image synthesis using the same metrics. We show the quantitative results in the supplementary file. Visual comparisons can be found in Figure 6. It confirms that our approach produces superior results. For example, Self-tuning almost completely misses the holes in the 1st input texture image, and pix2pixHD simply enlarges the local contents instead of performing synthesis. In Figure 7, we show the 4

4 times laerger texture synthesis results using our framework. This is done by running the transformer network twice with each performing 2

2 times laerger synthesis.

SSIM FID c-FID LPIPS c-LPIPS
Self-sim. Map (default) 0.437 74.35 0.366 00.273 0.272
Learnable TransConv 0.3087 88.05 0.387 0.331 0.2797
Fixed Map 0.2966 97.79 0.383 0.3554 0.2848
Random Map 0.2959 76.51 0.387 0.336 0.2645
Table 4. Ablation study for transposed convolution operation and self-similarity map. For SSIM, the higher the better; for other metrics, the lower the better. The first row represents the transposer framework taking self-similarity map as inputs, the default setting in this paper.
Input Ours Learn. TransConv Fixed Map Random Map
Figure 8. Ablation study for using transposed convolution operations and self-similarity maps. It can be seen that without using them, the results become much worse.
Figure 9. Direct 20482048 texture generation from 128128 input by sampling random noise maps. Zoom in for more details. Left small image is the input; right large image is the output.
input result 1 result 2 result 3
Figure 10. Diverse outputs given different random noises as inputs.

4.4. Ablation Study and Random Noise as Input

4.4.1. Ablation study

To understand the role of self-similarity map, we conduct three additional ablation study experiments: 1). Learnable TransConv: using the traditional transposed convolution layer with learnable parameters instead of directly using encoded feature as filters and its self-similarity map as input, while keeping other network parts and training strategies unchanged; 2). Fixed Map: using fixed sampled maps instead of self-similarity maps; 3). Random Map: using randomly sampled maps instead of self-similarity maps. As shown in Figure 2, we have 3 different scales’ features, for running Fixed map and Random map, we sample the map for the smallest scale’s feature and then bilinear upsampling it for the other two scales. Table 4 and Figure 8 show the quantitative and qualitative results, respectively. These 3 settings are compared with the default transformer setting, using self-similarity map as transposed convolution input. It can be seen that Learnable TransConv with the traditional learnable transposed convolution layer will simply enlarge the input rather than perform reasonable synthesis, similar to pix2pixHD. This confirms our hypothesis that conventional CNN designs with traditional (de)convolution layers and up/down-sampling layers cannot capture the long-term structural dependency required by texture synthesis. Fixed map can’t produce faithful results. On the other hand, using random noise map as transposed convolution input has both advantages and disadvantages, as discussed below.

4.4.2. Trade-off between self-similarity map and random noise map

In the last column of Figure 8, the 1st row shows that sampling a random noise map at test time can successfully generate diverse results. However, note that the self-similarity map is critical in identifying the structural patterns and preserving them in the output. In the 2nd row of Figure 8, the result of using self-similarity maps successfully preserved the regular structures, while using random noise maps failed. We believe that in practice, there is a trade-off between preserving the structure and generating variety. For input texture images with regular structural patterns, self-similarity map provides better guidance for the transposed convolution operation to preserve these structural patterns. On the other hand, using random noise map as inputs can generate diverse outputs by sampling different noise maps, shown in Figure 10 and it is also possible to directly generate arbitrary large texture outputs by sampling larger noise map, shown in Figure 9 while using self-similarity map can only do smaller than 33 times larger synthesis, limited by the size of self-similarity map.

5. Conclusion & Discussion

Figure 11. Failure case for our method. From left to right: input, our synthesis result and the ground truth.

In this paper, we present a new deep learning based texture synthesis framework built based on transposed convolution operations. In our framework, the transposed convolution filter is the encoded features of the input texture image, and the input to the transposed convolution is the self-similarity map computed on the corresponding encoded features. Quantitative comparisons based on existing metrics, our specifically designed metrics for texture synthesis, and user study results all show that our method significantly outperforms existing methods, while our method also being much faster. Self-similarity map helps preserve the structure better while random noise map allows to generate diverse results. Some further research could also be providing more control-able flexibility by combining both self-similarity map and random noise map as inputs. One limitation of our method is that it fails to handle sparse thin structures like shown in Figure 11 and highly non-stationary inputs (Zhou et al., 2018). As some highly non-stationary textures mainly emphasize the effect on some specific direction, one possible solution to deal with them may be emphasizing the similarity score on specific directions while suppressing it on other directions to capture directional effects, and/or using cropped, resized or rotated feature maps as transposed convolution filters to capture the effects of textons repeating with various forms. We leave these as future research exploration. While existing deep learning-based image synthesis methods mostly focus on taking the inputs from other modalities like semantic maps or edge maps, we believe our method will also stimulate more deep learning researches for exemplar-based synthesis.

Acknowledgements.
We would like to thanks Brandon Rowllet, Sifei Liu, Aysegul Dundar, Kevin Shih, Rafael Valle and Robert Pottorff for valuable discussions and proof-reading.

References

  • S. Abdelmounaime and H. Dong-Chen (2013) New brodatz-based image databases for grayscale color and multiband texture analysis. Cited by: §4.1.
  • A. Alanov, M. Kochurov, D. Volkhonskiy, D. Yashkov, E. Burnaev, and D. Vetrov (2019) User-controllable multi-texture synthesis with generative adversarial networks. arXiv preprint arXiv:1904.04751. Cited by: §2, §2.
  • U. Bergmann, N. Jetchev, and R. Vollgraf (2017) Learning texture manifolds with the periodic spatial gan. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 469–477. Cited by: §2, §2.
  • G. J. Burghouts and J. Geusebroek (2009) Material-specific adaptation of color invariant features. Pattern Recognition Letters 30 (3), pp. 306 – 313. External Links: ISSN 0167-8655, Document, Link Cited by: §4.1.
  • [5] U. o. O. F. Center for Machine Vision Research Outex texture database. External Links: Link Cited by: §4.1.
  • M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • D. Dai, H. Riemenschneider, and L. Van Gool (2014) The synthesizability of texture examples. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen (2012) Image melding: combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics (ToG) 31 (4), pp. 1–10. Cited by: §2.
  • V. Dumoulin and F. Visin (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §1.
  • A. Dundar, K. Sapra, G. Liu, A. Tao, and B. Catanzaro (2020) Panoptic-based image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8070–8079. Cited by: §2.
  • A. A. Efros and W. T. Freeman (2001) Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 341–346. Cited by: §1, §2.
  • A. A. Efros and T. K. Leung (1999) Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2, pp. 1033–1038. Cited by: §1, §2.
  • M. Fritz, E. Hayman, B. Caputo, and J. Eklundh (2004) THE kth-tips database. Cited by: §4.1.
  • A. Frühstück, I. Alhashim, and P. Wonka (2019) TileGAN: synthesis of large-scale non-homogeneous textures. ACM Transactions on Graphics (Proc. SIGGRAPH) 38 (4), pp. 58:1–58:11. Cited by: §2, §2.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2015a) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Cited by: §3.2.
  • L. Gatys, A. S. Ecker, and M. Bethge (2015b)

    Texture synthesis using convolutional neural networks

    .
    In Advances in neural information processing systems, pp. 262–270. Cited by: Figure 1, §1, §2, §2, §4.2, Table 1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin (2001) Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 327–340. Cited by: §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In nips, Cited by: item 1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    .
    CoRR abs/1611.07004. External Links: Link, 1611.07004 Cited by: §2.
  • N. Jetchev, U. Bergmann, and R. Vollgraf (2016) Texture synthesis with spatial generative adversarial networks. CoRR abs/1611.08207. External Links: Link, 1611.08207 Cited by: §2, §2.
  • J. Johnson, A. Alahi, and F. Li (2016) Perceptual losses for real-time style transfer and super-resolution. CoRR abs/1603.08155. External Links: Link, 1603.08155 Cited by: §2.
  • A. Kaspar, B. Neubert, D. Lischinski, M. Pauly, and J. Kopf (2015) Self tuning texture optimization. In Computer Graphics Forum, Vol. 34, pp. 349–359. Cited by: Figure 1, §1, §2, §2, §2, §4.2, Table 1.
  • V. Kwatra, I. Essa, A. Bobick, and N. Kwatra (2005) Texture optimization for example-based synthesis. In ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, New York, NY, USA, pp. 795–802. External Links: Link, Document Cited by: §1, §2, §2.
  • V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick (2003) Graphcut textures: image and video synthesis using graph cuts. In ACM Transactions on Graphics (ToG), Vol. 22, pp. 277–286. Cited by: §1, §2, §2.
  • C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2016) Photo-realistic single image super-resolution using a generative adversarial network. CoRR abs/1609.04802. External Links: Link, 1609.04802 Cited by: §2.
  • S. Lefebvre and H. Hoppe (2006) Appearance-space texture synthesis. ACM Transactions on Graphics (TOG) 25 (3), pp. 541–548. Cited by: §2.
  • C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. CoRR abs/1604.04382. External Links: Link, 1604.04382 Cited by: §2, §2.
  • Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017a) Diversified texture synthesis with feed-forward networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3920–3928. Cited by: §1, §1, §2, §2.
  • Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017b) Universal style transfer via feature transforms. CoRR abs/1705.08086. External Links: Link, 1705.08086 Cited by: Figure 1, §1, §2, §4.2, Table 1.
  • L. Liang, C. Liu, Y. Xu, B. Guo, and H. Shum (2001) Real-time texture synthesis by patch-based sampling. ACM Transactions on Graphics (ToG) 20 (3), pp. 127–150. Cited by: §2.
  • G. Liu, Y. Gousseau, and G. Xia (2016) Texture synthesis through convolutional neural networks and spectrum constraints. CoRR abs/1605.01141. External Links: Link, 1605.01141 Cited by: §2.
  • G. Liu, K. J. Shih, T. Wang, F. A. Reda, K. Sapra, Z. Yu, A. Tao, and B. Catanzaro (2018)

    Partial convolution based padding

    .
    arXiv preprint arXiv:1811.11718. Cited by: §A.4.
  • P. Mallikarjuna, A. Targhi, M. Fritz, E. Hayman, B. Caputo, and J. Eklundh (2006) THE kth-tips2 database. pp. . Cited by: §4.1.
  • R. Picard, C. Graczyk, S. Mann, J. Wachman, L. Picard, and L. Campbell (2010) VisTex vision texture database. pp. . Cited by: §4.1.
  • J. Portilla and E. P. Simoncelli (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision 40 (1), pp. 49–70. External Links: ISSN 1573-1405, Document, Link Cited by: §2, §2.
  • [38] Y. Pritch, E. Kav-Venaki, and S. Peleg Shift-map image editing. In 2009 IEEE 12th International Conference on Computer Vision, pp. 151–158. Cited by: §2.
  • A. Rosenberger, D. Cohen-Or, and D. Lischinski (2009) Layered shape synthesis: automatic generation of control maps for non-stationary textures. In ACM SIGGRAPH Asia 2009 Papers, SIGGRAPH Asia ’09, New York, NY, USA, pp. 107:1–107:9. External Links: ISBN 978-1-60558-858-2, Link, Document Cited by: §2, §2.
  • O. Sendik and D. Cohen-Or (2017) Deep correlations for texture synthesis. ACM Transactions on Graphics (TOG) 36 (5), pp. 161. Cited by: §2, §2.
  • T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: Figure 1, §1, §1, §2, §2, §4.2, Table 1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
  • D. Ulyanov, V. Lebedev, V. Lempitsky, et al. (2016) Texture networks: feed-forward synthesis of textures and stylized images. In International Conference on Machine Learning, pp. 1349–1357. Cited by: §2.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §A.4, Figure 1, §1, §2, §4.2, Table 1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: item 1.
  • L. Wei, S. Lefebvre, V. Kwatra, and G. Turk (2009) State of the art in example-based texture synthesis. In Eurographics ’09 State of the Art Reports (STARs), External Links: Link Cited by: §2.
  • L. Wei and M. Levoy (2000)

    Fast texture synthesis using tree-structured vector quantization

    .
    In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 479–488. Cited by: §2.
  • Q. Wu and Y. Yu (2004) Feature matching and deformation for texture synthesis. ACM Transactions on Graphics (TOG) 23 (3), pp. 364–367. Cited by: §2.
  • R. Wu, W. Wang, and Y. Yu (2013) Optimized synthesis of art patterns and layered textures. IEEE transactions on visualization and computer graphics 20 (3), pp. 436–446. Cited by: §2.
  • N. Yu, C. Barnes, E. Shechtman, S. Amirghodsi, and M. Lukac (2019) Texture mixer: a network for controllable synthesis and interpolation of texture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12164–12173. Cited by: Figure 1, §2, §4.2, §4.2, Table 1.
  • J. Zhang, K. Zhou, L. Velho, B. Guo, and H. Shum (2003) Synthesis of progressively-variant textures on arbitrary surfaces. ACM Transactions on Graphics (TOG) 22 (3), pp. 295–302. Cited by: §2.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: item 1.
  • Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang (2018) Non-stationary texture synthesis by adversarial expansion. ACM Transactions on Graphics (Proc. SIGGRAPH) 37 (4), pp. 49:1–49:13. Cited by: Figure 1, §1, §1, §2, §2, §4.2, Table 1, §5.

Appendix A Framework Detail

a.1. Self-Similarity Computing & Transposed Convolution

The reviewers are welcome to check the attached animation video showing how a self-similarity map is computed and how the transposed convolution operation is performed.

a.2. Implementation Details for Computing Self-similarity Map

Computing self-similarity map can be efficiently implemented with the help of standard convolution operations. The formula for computing self-similarity map can be relaxed as the following:

(5)

Here, and indicate the overlapping region between current -shifted copy and the original copy. is the L2 norm of . The dominator is used for denormalization such that the scale of is independent of the scale of .

Implementation Details. can be computed by using as convolution input and a convolution filter with weights being all 1s and biases being all 0s. can be computed by using the zero-padded , with zero padding on top and bottom sides and zero padding on left and right sides as convolution input and as convolution filter. Similarly, can be computed by using a map, with the center region being 1 and other region being 0, as convolution input and as convolution filter.

a.3. Transposed Convolution Block

Table 5 lists the main differences between typical transposed convolution operation and our transposed convolution operation.

Fig. 12 shows the details for transposed convolution block in our framework.

typical transposed conv operation our transposed conv operation
input output from previous layer self-similarity map from encoded features
filter learn-able parameters feature maps from encoder
bias term learn-able parameters

avg-pooling of encoded features with linear transform

filter size small (e.g. 4x4, 3x3) large (e.g. 8x8, 16x16, 32x32, 64x64)
stride 2(for upsampling purpose) 1
Table 5. Main differences between typical transposed convolution and our transposed convolution operation
Figure 12. The details of a transposed convolution block. The left part shows the corresponding preview which is used in the Figure 3 in the main paper; the right part shows the details of this transposed convolution block.

a.4. Network Details

Table 7 shows the details of generator. The discriminator network is the same with pix2pixHD (Wang et al., 2018). We use partial convolution based padding (Liu et al., 2018) instead of zero padding for all the convolution layers.

SSIM FID c-FID LPIPS c-LPIPS
Self-tuning 0.3157 95.829 0.4393 0.4078 0.3653
Non-station. 0.3349 120.245 1.6888 0.4226 0.3911
sinGAN 0.3270 147.9333 1.3806 0.4230 0.3829
pix2pixHD 0.3253 131.655 0.5472 0.4193 0.3780
Ours 0.4533 78.4808 0.3973 0.3246 0.3563
Non-station. 0.4915 211.0645 1.4274 0.3411 0.3893
sinGAN 0.2913 154.651 1.6909 0.4787 0.4364
DeepTexture 0.3011 82.053 0.5649 0.4175 0.3830
WCT 0.3124 144.208 0.4125 0.4427 0.4068
Table 6. 256 to 512 synthesis scores for different approaches averaged over 200 images. Non-station., sinGAN, DeepTexture and WCT directly take the ground truth images as inputs.
Block Filter Size # Filters Stride/Up Factor Sync BN ReLU
Encoder Conv1 33 3 64 1 Y Y
Conv2_1 33 64 128 2 Y Y
Conv2_2 33 128 128 1 Y Y
Conv3_1 33 128 256 2 Y Y
Conv3_2 33 256 256 1 Y Y
Conv4_1 33 256 512 2 Y Y
Conv4_2 33 512 512 1 Y Y
Conv5_1 33 512 1024 2 Y Y
Conv5_2 33 1024 1024 1 Y Y
FilterBranch_Conv1 33 256 256 1 - Y
FilterBranch_Conv2 33 256 256 1 - -
FilterBranch_FC1 - 256 256 1 - -
TransConv_Block3 SelfSimilarityMapBranch_Conv1 33 1 8 1 - Y
(w/ Conv3_2) SelfSimilarityMapBranch_Conv2 33 8 1 1 - -
transposed Convolution Operation filter: 256, input: 1 256 - - -
OutputBranch_Conv 33 256 256 1 - Y
FilterBranch_Conv1 33 512 512 1 - Y
FilterBranch_Conv2 33 512 512 1 - -
TransConv_Block4 FilterBranch_FC1 - 512 512 1 - -
(w/ Conv4_2) SelfSimilarityMapBranch_Conv1 33 1 8 1 - Y
SelfSimilarityMapBranch_Conv2 33 8 1 1 - -
transposed Convolution Operation filter: 512, input: 1 512 - - -
OutputBranch_Conv 33 512 512 1 - Y
FilterBranch_Conv1 33 1024 1024 1 - Y
FilterBranch_Conv2 33 1024 1024 1 - -
TransConv_Block5 FilterBranch_FC1 - 1024 1024 1 - -
(w/ Conv5_2) SelfSimilarityMapBranch_Conv1 33 1 8 1 - Y
SelfSimilarityMapBranch_Conv2 33 8 1 1 - -
transposed Convolution Operation filter: 1024, input: 1 1024 - - -
OutputBranch_Conv 33 1024 1024 1 - Y
Decoder BilinearUpSample1
(w/ TransConv_Block5 output)
- - 2 - -
Conv6 33 1024 512 1 Y Y
Sum (Conv6 + TransConv_Block4 output) - - - - -
BilinearUpSample2 - - 2 - -
Conv7 33 512 256 1 Y Y
Sum (Conv7 + TransConv_Block3 output) - - - - -
BilinearUpSample3 - - 2 - -
Conv8 33 256 128 1 Y Y
BilinearUpSample4 - - 2 - -
Conv9 33 128 64 1 Y Y
Conv10 33 64 3 1 - -
Table 7. The details of network parameters. TransConv_Block3-5 represent the three transposed convolution blocks in our framework (The diagrams can be found in Figure 2 in the main paper). SyncBatchNorm column indicates Synchronized Batch Normalization layer after Conv. ReLU column shows whether ReLU is used (following the SyncBatchNorm if SyncBatchNorm is used). BilinearUpSample represents bilinear upsampling. Sum denotes the simple summation. orig_H and orig_W are input image’s height and width.

Appendix B Additional Comparison

b.1. 256 to 512 Synthesis

In Table 6, we provide the quantitative comparisons for the synthesis results of 256 to 512.

b.2. 128 to 256 Synthesis

Non-stat. and Non-stat. baselines: we take the original code from the author’s github repository. The original training strategy for each training iteration is: 1). randomly crop a from the original big image() as the target image; 2). from the target image, randomly crop a image as the input image. Thus, for 128 to 256 synthesis, to train Non-stat. (without seeing ground truth image), for each training iteration, we randomly crop a image from the input image as target image then from the target image, we randomly crop a image as input. To train Non-stat.* (with directly seeing ground truth image), for each training iteration, we randomly crop a image from the ground truth image as the target image and then from the target image, we randomly crop a image as input image. For both Non-stat. and Non-stat.*, the inference stage will take image as input.

sinGAN and sinGAN baselines: for training with sinGAN, we used the original author’s implementation available on github. And we used the default settings the author provided in their source code. sinGAN code can synthesize textures in two different modes, one that generates a random variation which is of the same size as input texture (we directly using ground truth for training, denoted as sinGAN), and another that generates a texture of larger size (only using image, denoted as sinGAN).

input ours Self-tuning pix2pixHD WCT sinGAN Non-stat. DeepTexture ground truth
Figure 13. Results of different approaches on 128 to 256 texture synthesis. sinGAN Non-stat. show the results of training with directly seeing the ground truth at target size. (Training with ground truth means using the ground truth 256256 image as the target for each training iteration.) WCT is the style transfer based method. DeepTexture directly takes ground truth images as inputs.
input ours Self-tuning pix2pixHD WCT SinGAN Non-stat. DeepTexture ground truth
Figure 14. Results of different approaches on 128 to 256 texture synthesis. SinGAN and Non-stat. results show the results of training without directly seeing ground truth at the exact target size.