Texture synthesis is defined as the problem of generating a large image output given a small example input such that the visual features and structures are preserved both locally and globally. Many methods have been explored in the past two decades including pixel-based methods (Efros and Leung, 1999), assembling based methods (Efros and Freeman, 2001; Kwatra et al., 2003), optimization based methods (Kwatra et al., 2005; Kaspar et al., 2015), etc.
Inspired by the unprecedented success of deep learning in computer vision, others have explored deep learning methods for texture synthesis. Existing works fall into one of two categories. Either an optimization procedure is used to match deep feature statistics in a pre-trained network(Gatys et al., 2015b; Li et al., 2017b), resulting in a slow generation process; or a network is trained to overfit on a fixed image or set of images (Li et al., 2017a; Zhou et al., 2018; Shaham et al., 2019), which prevents it from generalizing to unseen textures and needs to spend huge re-training time for every unseen texture image.
is because these one-model-per-image (set) approaches usually employ conventional image-to-image translation networks, which first embed the input into a feature space and then fully rely on a sequence of upsampling and convolutional layers to reach the target output size. Each upsampling and convolutional layer is a local operation lacking of global awareness. This design works well for tasks such as image super resolution, where the task is to enhance or modify local details. However, texture synthesis differs from super resolution in that texture synthesis, when viewed from a classical perspective, involves displacing and assembling copies of the input texture using different optimal offsets in a seamless way. The optimal displacement and assembling strategy involves much longer-range operations and compatibility checking, which are usually not easy to model with the conventional design by fully relying on a sequence of local up/down-sampling and (de)convolutional layers.
In the column pix2pixHD of Figure 1, we show that a conventional image-to-image translation network adapted from pix2pixHD (Wang et al., 2018) fails to perform reasonable texture synthesis, but instead mostly just enlarges the local contents for the input textures even though it has been trained to convergence using the same input and output pairs as our method.
In this paper, we propose a new deep learning based texture synthesis framework that generalizes to arbitrary unseen textures and synthesizes larger-size outputs. From a classical view, the texture synthesis task can also be interpreted as the problem of first finding an appropriate offset to place a copy of the input texture image and then using optimization technique to find the optimal seam between this newly placed copy and the existing image to assemble them together. Our method follows some similar spirits but have some major differences in the following manner: 1) We perform assembling in feature space and at multiple scales; 2) The optimal shifting offset and assembling weights are modeled with the help of a score map, which captures the similarity and correlation between different regions of the encoded texture image. We call this score map a self-similarity map (discussed in details in Section 3); 3) We later show that the shifting and assembling operations can be efficiently performed with a single forward pass of a transposed convolution operation (Dumoulin and Visin, 2016), where we directly use the encoded feature of input textures as transposed convolution filters, and the self-similarity map as transposed convolution input. Unlike traditional transposed convolution, our transposed convolution filters are not learnable parameters. While self-similarity map plays a key role in preserving the regular structural patterns, alternately, our framework also allows to take random noise map as input instead of self-similarity map to generate diverse outputs and arbitrarily large texture output with a single shot by accordingly sampling large random noise map for irregular structural texture inputs.
In this work, we make the following contributions: 1) We present a generalizable texture synthesis framework that performs faithful synthesis on unseen texture images in nearly real time with a single forward pass. 2) We propose a self-similarity map that captures the similarity and correlation information between different regions of a given texture image. 3) We show that the shifting and assembling operations in traditional texture synthesis methods can be efficiently implemented using a transposed convolution operation. 4) We achieve state-of-the-art texture synthesis quality as measured by existing image metrics, metrics designed specifically for texture synthesis, and in user study. 5) We show that our framework is also able to generate diverse and arbitrarily large texture synthesis results by sampling random noise maps.
2. Related Work
We provide a brief overview of the existing texture synthesis methods. A complete texture synthesis survey can be found in (Wei et al., 2009), which is out of the scope of this work.
Non-parametric Texture Synthesis. Existing texture synthesis methods include pixel-based methods (Efros and Leung, 1999; Wei and Levoy, 2000), assembling based methods (Efros and Freeman, 2001; Liang et al., 2001; Kwatra et al., 2003; Pritch et al., ), optimization based methods (Portilla and Simoncelli, 2000; Kwatra et al., 2005; Rosenberger et al., 2009; Kaspar et al., 2015), appearance space synthesis (Lefebvre and Hoppe, 2006), etc. There are also some other works (Hertzmann et al., 2001; Zhang et al., 2003; Wu and Yu, 2004; Lefebvre and Hoppe, 2006; Rosenberger et al., 2009; Wu et al., 2013) showing interesting synthesis results; however, they usually need additional user manual inputs.
Among these traditional methods, self-tuning texture optimization (Kaspar et al., 2015) is the current state-of-the-art method. It uses image melding (Darabi et al., 2012) with automatically generated and weighted guidance channels, which helps to reconstruct the middle-scale structures in the input texture. Our method is motivated by assembling based methods. (Kwatra et al., 2003) is a representative method of this kind, where texture synthesis is formulated as a graph cut problem. The optimal offset for displacing the input patch and the optimal cut between the patches can be found by solving the graph cut objective function, which sometimes could be slow.
rely on matching the global statistics of the hand-crafted features defined on the input and output textures. Recently, some deep neural networks based methods have been proposed as a way to use the features learned from natural image priors to guide the optimization procedure. Gatys et al.(Gatys et al., 2015b) define the optimization procedure as minimizing the difference in gram matrices of the deep features between the input and output texture images. Sendik et al. (Sendik and Cohen-Or, 2017) and Liu et al. (Liu et al., 2016) modify the loss proposed in (Gatys et al., 2015b) by adding a structural energy term and a spectrum constraint, respectively, to generate structured and regular textures. However, in all cases, these optimization-based methods are prohibitively slow due to the iterative optimizations.
Learning-based Texture Synthesis. Johnson et al. (Johnson et al., 2016) and Ulyanov (Ulyanov et al., 2016) alleviate the previously mentioned optimization problem by training a neural network to directly generate the output, using the same loss as in (Gatys et al., 2015b). This setup moves the computational burden to training time, resulting in faster inference time. However, the learned network can only synthesize the texture it was trained on and cannot generalize to new textures.
A more recent line of work (Zhou et al., 2018; Shaham et al., 2019; Li and Wand, 2016; Li et al., 2017a; Frühstück et al., 2019; Jetchev et al., 2016; Bergmann et al., 2017; Alanov et al., 2019) has proposed using Generative Adversarial Networks (GANs) for more realistic texture synthesis while still suffering from the inability to generalize to new unseen textures.
Zhou et al. (Zhou et al., 2018) learn a generator network that expands texture blocks into output through a combination of adversarial, , and style (gram matrix) loss. Li et al. and Shaham et al. (Li and Wand, 2016; Shaham et al., 2019) use a special discriminator that examines statistics of local patches in feature space. However, even these approaches can only synthesize a single texture which it has been trained on.
Other efforts (Li et al., 2017a; Jetchev et al., 2016; Bergmann et al., 2017; Alanov et al., 2019; Frühstück et al., 2019) try to train on a set of texture images. During test time, the texture being generated is either chosen by the network (Jetchev et al., 2016; Bergmann et al., 2017) or user-controlled (Li et al., 2017a; Alanov et al., 2019). (Frühstück et al., 2019) propose a non-parametric method to synthesize large-scale, varied outputs by combining intermediate feature maps. However, these approaches limit generation to textures available in the training set, and thus are unable to produce unseen textures out of the training set.
Li et al. (Li et al., 2017b) apply a novel whitening and colorizing transform to an encoder-decoder architecture, allowing them to generalize to unseen textures, but rely on inner SVD decomposition which is slow. Additionally, it can only output texture images with the same size as the input.
Yu et al. (Yu et al., 2019)
perform the interpolation between two or more source textures. While forcing two source textures to be identical can convert it to the texture synthesis setting, it will reduce the framework to be more like a conventional CNN. Besides probably suffering from the issues of conventional CNNs, its main operations of per-feature-entry shuffling, retiling and blending would greatly break the regular or large structure patterns in the input.
Other Image-to-Image Tasks. GANs (Goodfellow et al., 2014) have also been used in other image-to-image tasks (Isola et al., 2016; Wang et al., 2018; Dundar et al., 2020). Ledig et al. (Ledig et al., 2016) used it to tackle the problem of super-resolution, where detail is added to a low-resolution image to produce high-definition output. In contrast to these tasks, the texture synthesis problem is to synthesize new, varied regions similar to the input, and not to provide more details to an existing layout as in (Ledig et al., 2016) or translate the texture to a related domain as in (Isola et al., 2016)
. Other recipes like converting texture synthesis to an image inpainting problem usually cannot get satisfying results as they usually cannot handle big holes where we need to do the synthesis.
Similarity Map. Our framework relies on computing the self-similarity map, which is similar in spirit to the deep correlation formulation in (Sendik and Cohen-Or, 2017). The difference is that (Sendik and Cohen-Or, 2017) computes a dot product map between the feature map and its spatially shifted version and uses it as a regularizing term in their optimization objective; in contrast, we aggregate all the channels’ information to compute a single-channel difference similarity map and use it to model the optimal synthesis scheme in the network with a single pass.
3. Our Approach
Problem Definition: Given an input texture patch, we want to expand the input texture to a larger output whose local pattern resembles the input texture pattern. Our approach to this problem shares some similar spirits with the traditional assembling based methods which try to find the optimal displacements of copies of the input texture, as well as the corresponding assembly scheme to produce a large, realistic texture image output. We will first formulate the texture expansion problem as a weighted linear combination of displaced deep features at various shifting positions, and then discuss how to use the transposed convolution operation to address it.
Deep Feature Expansion: Let be the deep features of an input texture patch, with , and being the number of channels, the height, and width, respectively. We create a spatially expanded feature map, for instance by a factor of 2, by simply pasting and accumulating into a space. This is done by shifting along the width axis with a progressive step ranging from 0 to , as well as along the height axis with a step ranging from 0 to . All the shifted maps are then aggregated together to give us an expanded feature map .
To aggregate the shifted copies of , we compute a weighted sum of them. For instance, to calculate the feature , we aggregate all possible shifted copies of that fall in the spatial location
. While previous approaches rely on hand crafted or other heuristics to aggregate the overlapping features, in our approach, we propose to weight each shifted feature map with a similarity score that quantifies the semantic distance between the originaland its shifted copy. Finally, aggregation is done by simple summation of the weighted features. Mathematically, can be given by
where , , , , , is the similarity score of -shifting, and is the projection of ’s -shifted copy on the grid. Namely, , with , , and .
We compute the similar score of current -shifting using the overlapping region based on the following equation:
Here, and indicate the overlapping region between current -shifted copy and the original copy. is the L2 norm of . The dominator is used for denormalization such that the scale of is independent of the scale of . Figure 2(a) shows how the self-similarity score is computed at shifting . Note that self-similarity map is not symmetric with respect to its center as the dominator of Equation 2 is not symmetric with respect to the center. Full animation of computing the self-similarity map can be found in the supplementary video. We will apply some simple transformation on before using it in Equation 1, specifically one convolutional layer and one activation layer in our implementation.
As shown in Equation 2, the similarity score for a shift of along the width and height axis, respectively, is calculated as the L2 distance between the un-shifted and shifted copies of the feature map, normalized by the norm of the un-shifted copy’s overlapping region. So, a shift of gives the maximum score because there is no shifting and it exactly matches the original copy. Computing self-similarity maps can be efficiently implemented with the help of existing convolution operations. Details are discussed in the supplementary file.
We compute the self-similarity maps at multiple scales. Different texture images may exhibit more obvious self-similarity patterns on a specific scale than other scales, as shown in Figure 3.
Feature (Texture) Expansion via Transposed Convolution Operation: Note that the process of pasting shifted feature maps and aggregating them to create larger feature maps is equivalent to the operation of a standard transposed convolution in deep neural networks. For the given filter and input data, a transposed convolution operation simply copies the filter weighted by the respective center entry’s data value in the input data into a larger feature output grid, and perform a summation. In fact, our proposed Equation 1
is similar with a transposed convolution. Specifically, we apply transposed convolutions with a stride of, treating the feature map as the transposed convolution filter, and the similarity map , given by Equation 2, as the input to the transposed convolution. This results in an output feature map of size . Figure 2(b) shows how the transposed convolution is done using the encoded input texture as filters and the first entry in the self-similarity map as input. Full animation of the transposed convolution operation can be found in the supplementary video.
Figure 2(c) illustrates our overall texture synthesis framework. It relies on a UNet-like architecture. The encoder extracts deep features of the input texture patch at several scales. We then apply our proposed transposed convolution-based feature map expansion technique at each scale. The resulting expanded feature map is then passed onto a standard decoder layer. Our network is fully differentiable, allowing us to train our model with stochastic gradient-based optimizations in an end-to-end manner. The four main components of our framework in Figure 2(c) are:
Encoder: Learns to encode the input texture image into deep features at different scales or levels.
Transposed Convolution Operation: Applies spatially varying transposed convolution operations, treating the encoded feature maps directly as filters and the self-similarity maps as inputs to produce expanded feature maps, as shown in Figure 2(b). Note that, unlike traditional transposed convolution layers, ours transposed convolution filters are not learning parameters. More details about the difference between our transposed convolution operation and traditional transposed convolution layer can be found in the suppmental file.
Decoder: Given the already expanded features from the transposed convolution operations at different scales, we follow the traditional decoder network design that uses standard convolutional layers followed by bilinear upsampling layers to aggregate features at different scales, and generate the final output texture, as shown in the last row of Figure 2(c).
As described above, our proposed texture expansion technique is performed at multiple feature representation levels, allowing us to capture both diverse features and their optimal aggregation weights. Unlike previous approaches that rely on heuristics or graph-base techniques to identify the optimal overlap of shifted textures, our approach formulates the problem as a direct generation of larger texture images conditioned on optimally assembled deep features at multiple scales. This makes our approach desirable as it is data-driven and generalizable for various textures.
3.2. Loss Functions
|input||no perceptual loss||no style loss||non GAN loss||full loss|
Ablation study on the components of loss functions.
During training, given a random image with size (2, 2), denoted as , its center crop image with size (,) will be the input to the network, denoted as . We train the network to predict an output image with the size (2, 2). Both VGG-based perceptual loss, style loss and GAN loss are used to train the network. The perceptual loss and style loss are defined between and at the full resolution of ; meanwhile, the GAN loss is defined on the random crops at the resolution of . Details are discussed below.
VGG-based perceptual loss and style loss. Perceptual loss and style loss are defined following Gatys et al. (Gatys et al., 2015a).
The perceptual loss and style loss are defined as:
Here, is the number of entries in . The perceptual loss computes the distances between both and
, but after projecting these images into higher level feature spaces using an ImageNet-pretrained VGG-19(Simonyan and Zisserman, 2014). is the activation map of the th selected layer given original input . We use feature from -nd, -th, -th, -st and
-th layers corresponding to the output of the ReLU layers at each scale. In Equation (4), the matrix operations assume that the high level featuresis of shape , resulting in a Gram matrix, and is the normalization factor for the th selected layer.
GAN loss. The discriminator takes the concatenation of and a random crop of size from either or as input. Denote the random crop from as and the random crop from as
. The intuition of using concatenation is to let the discriminator learn to classify whetherand is a pair of two similar texture patches or not. We randomly crop 10 times for both and and sum up the losses.
Ablation Study. These 3 losses are summed up with the weights of 0.05, 120 and 0.2 respectively. We find that all of them are useful and necessary. As shown in Figure 4, without perceptual loss, the result just looks like the naive tiling of the inputs; no style loss makes the border region blurry; and no GAN loss leads to obvious discrepancy between the center region and the border region.
4. Experiments and Comparisons
|input||transposer(ours)||self-tuning||pix2pixHD||SinGAN||Non-stat.||WCT||DeepText.||Text. Mixer||ground truth|
4.1. Dataset & Training
To train our network, we collected a large dataset of texture images. We downloaded 55,583 images from 15 different texture image sources (Cimpoi et al., 2014; sharan2014flickr; Dai et al., 2014; Burghouts and Geusebroek, 2009; Center for Machine Vision Research, ; Picard et al., 2010; Abdelmounaime and Dong-Chen, 2013; Fritz et al., 2004; Mallikarjuna et al., 2006). The total dataset consists of texture images with a wide variety of patterns, scales, and resolutions. We randomly split the dataset to create a training set of 49,583 images, a validation set of 1,000 images, and a test set of 5,000 images. All generation results and evaluation results in the paper are from the test set. When using these images, we resize them to the target output size as the ground truth and the center cropping of it as input.
Our network utilizing the transposed convolution operation is implemented using the existing PyTorch interface without custom CUDA kernels. We trained our model on 4 DGX-1 stations with 32 total NVIDIA Tesla V100 GPUs using synchronized batch normalization layers(Ioffe and Szegedy, 2015). For 128128 to 256
256 synthesis, we use batch size 8 and trained for 600 epochs. The learning rate is set to be 0.0032 at the beginning and decreased byevery 150 epochs. For 256256 to 512512 synthesis, we fine-tuned the model based on the pre-trained one for 128128 to 256256 synthesis for 200 epochs. While directly using 128 to 256 synthesis pre-trained model generates reasonable results, fine-tuning leads to better quality.
4.2. Baseline & Evaluation Metrics
|Self-tuning(Kaspar et al., 2015)||140 s||195 s||Good||Yes|
|Non-stationary(Zhou et al., 2018)||362 mins||380 mins||No||Yes|
|SinGAN(Shaham et al., 2019)||45 mins||100 mins||No||Yes|
|DeepTexture(Gatys et al., 2015b)||13 mins||54 mins||No||No|
|WCT(Li et al., 2017b)||7 s||14 s||Medium||Yes|
|pix2pixHD (Wang et al., 2018)||11 ms||22 ms||Medium||Yes|
|Texture Mixer (Yu et al., 2019)||-||799 ms||Medium||Yes|
|transposer(ours)||43 ms||260 ms||Good||Yes|
Baselines. We compare against several baselines: 1) Naive tiling which simply tiles the input four times; 2) Self-tuning (Kaspar et al., 2015), the state-of-the-art optimization-based method; 3) pix2pixHD (Wang et al., 2018), the state-of-the-art image-to-image translation network where we add one more upsampling layer to generate an output 2x2 larger than the input; 4) WCT (Li et al., 2017b) is the style transfer method; 5) DeepTexture (Gatys et al., 2015b), an optimization based using network features, for which we directly feed the ground truth as input; 6) Texture Mixer (Yu et al., 2019), a texture interpolation method where we set the interpolation source patches to be all from the input texture; 7) Non-stationary (Non-stat.) (Zhou et al., 2018) and SinGAN (Shaham et al., 2019), both of which overfit one model per texture. We train Non-stat. and SinGAN for two versions respectively; one version with direct access to the exact ground truth at the exact target size, and one version without access to target-size ground truth but only the input. In the paper, will correspond to methods that either directly take ground truth images for processing or are overfitting the model to ground truth.
Table 1 shows the runtime and corresponding properties for all the methods. Compared with Self-tuning, our method is much faster. In contrast to Non-stat. and SinGAN, transposer (ours) generalizes better and hence does not require per image training. Comparing with DeepTexture and the style transfer method WCT, our method is still much faster without the need of iterative optimization or SVD decomposition. Even though pix2pixHD is faster than our method, it cannot perform proper texture synthesis as shown in Figures 5 and 6, same as Texture Mixer (Yu et al., 2019).
Evaluation Metrics. To the best of our knowledge, there is no standard metric to quantitatively evaluate texture synthesis results. We use 3 groups of metrics (6 in total):
Existing metrics include SSIM (Wang et al., 2004), Learning Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) and Fréchet Inception Distance (FID) (Heusel et al., 2017). SSIM and LPIPS are evaluated using image pairs. FID measures the distribution distance between the generated image set and the ground truth image set in feature space.
Crop-based metrics designed for texture synthesis evaluation include crop-based LPIPS (c-LPIPS) and crop-based FID (c-FID). While the original LPIPS and FID are computed on full-size images, c-LPIPS and c-FID operate on crops of images. For c-FID, we crop a set of images from the output image and crop the other set from the ground truth image, and then compute the FID between these two sets (we use a dimension of 64 for c-FID instead of the default 2048 due to a much smaller image set). For c-LPIPS, we compute the LPIPS between the input image and one of the 8 random crops from the output image, and average the scores among the 8 crops.
User Study. Another way to measure the performances of different methods is by performing user study. We thus use Amazon Mechanical Turk (AMT) to evaluate the quality of synthesized textures. We perform AB tests where we provide the user the input texture image and two synthesized images from different methods. We then ask users to choose the one with better quality. Each image is viewed by workers, and the orders are randomized. The obtained preference scores (Pref.) are shown in Table 3, which indicate the portion of workers that prefer our result over the other method.
4.3. Comparison Results
4.3.1. Evaluating synthesis of size 128 to 256
We compare with Self-tuning, pix2pixHD and WCT on the whole test set of 5,000 images and show the quantitative comparisons in Table 2. It is noticeable that our method outperforms Self-tuning and pix2pixHD for all the metrics.
Due to the fact that Non-stat., SinGAN and DeepTexture are too slow to evaluate on all 5,000 test images, we randomly sampled 200 from the 5,000 test images to evaluate them. The visual comparison is shown in Figure 5. The numerical evaluation results are summarized in Table 3. As shown in 2nd-8th rows of Table 3, our method significantly out-performs all the methods which do not directly take ground truth as input. When compared with Self-tuning, we achieve better LPIPS score (0.273 vs. 0.358), and 63 of people prefer the results generated by our method over the ones generated by Self-tuning. The remaining rows of Table 3 also show that our method performs better than other size-increasing baselines (Non-stat. and SinGAN) and performs better or similar to DeepTexture, which all take ground truth as input. For instance, 51 people prefer our results over the ground truth and 46 of people prefer our results over DeepTexture, which directly takes ground truth for its optimization.
4.3.2. Evaluating synthesis of size 256 to 512
We also evaluate on 256256 image to 512512 image synthesis using the same metrics. We show the quantitative results in the supplementary file. Visual comparisons can be found in Figure 6. It confirms that our approach produces superior results. For example, Self-tuning almost completely misses the holes in the 1st input texture image, and pix2pixHD simply enlarges the local contents instead of performing synthesis. In Figure 7, we show the 4
4 times laerger texture synthesis results using our framework. This is done by running the transformer network twice with each performing 22 times laerger synthesis.
|Self-sim. Map (default)||0.437||74.35||0.366||00.273||0.272|
|Input||Ours||Learn. TransConv||Fixed Map||Random Map|
|input||result 1||result 2||result 3|
4.4. Ablation Study and Random Noise as Input
4.4.1. Ablation study
To understand the role of self-similarity map, we conduct three additional ablation study experiments: 1). Learnable TransConv: using the traditional transposed convolution layer with learnable parameters instead of directly using encoded feature as filters and its self-similarity map as input, while keeping other network parts and training strategies unchanged; 2). Fixed Map: using fixed sampled maps instead of self-similarity maps; 3). Random Map: using randomly sampled maps instead of self-similarity maps. As shown in Figure 2, we have 3 different scales’ features, for running Fixed map and Random map, we sample the map for the smallest scale’s feature and then bilinear upsampling it for the other two scales. Table 4 and Figure 8 show the quantitative and qualitative results, respectively. These 3 settings are compared with the default transformer setting, using self-similarity map as transposed convolution input. It can be seen that Learnable TransConv with the traditional learnable transposed convolution layer will simply enlarge the input rather than perform reasonable synthesis, similar to pix2pixHD. This confirms our hypothesis that conventional CNN designs with traditional (de)convolution layers and up/down-sampling layers cannot capture the long-term structural dependency required by texture synthesis. Fixed map can’t produce faithful results. On the other hand, using random noise map as transposed convolution input has both advantages and disadvantages, as discussed below.
4.4.2. Trade-off between self-similarity map and random noise map
In the last column of Figure 8, the 1st row shows that sampling a random noise map at test time can successfully generate diverse results. However, note that the self-similarity map is critical in identifying the structural patterns and preserving them in the output. In the 2nd row of Figure 8, the result of using self-similarity maps successfully preserved the regular structures, while using random noise maps failed. We believe that in practice, there is a trade-off between preserving the structure and generating variety. For input texture images with regular structural patterns, self-similarity map provides better guidance for the transposed convolution operation to preserve these structural patterns. On the other hand, using random noise map as inputs can generate diverse outputs by sampling different noise maps, shown in Figure 10 and it is also possible to directly generate arbitrary large texture outputs by sampling larger noise map, shown in Figure 9 while using self-similarity map can only do smaller than 33 times larger synthesis, limited by the size of self-similarity map.
5. Conclusion & Discussion
In this paper, we present a new deep learning based texture synthesis framework built based on transposed convolution operations. In our framework, the transposed convolution filter is the encoded features of the input texture image, and the input to the transposed convolution is the self-similarity map computed on the corresponding encoded features. Quantitative comparisons based on existing metrics, our specifically designed metrics for texture synthesis, and user study results all show that our method significantly outperforms existing methods, while our method also being much faster. Self-similarity map helps preserve the structure better while random noise map allows to generate diverse results. Some further research could also be providing more control-able flexibility by combining both self-similarity map and random noise map as inputs. One limitation of our method is that it fails to handle sparse thin structures like shown in Figure 11 and highly non-stationary inputs (Zhou et al., 2018). As some highly non-stationary textures mainly emphasize the effect on some specific direction, one possible solution to deal with them may be emphasizing the similarity score on specific directions while suppressing it on other directions to capture directional effects, and/or using cropped, resized or rotated feature maps as transposed convolution filters to capture the effects of textons repeating with various forms. We leave these as future research exploration. While existing deep learning-based image synthesis methods mostly focus on taking the inputs from other modalities like semantic maps or edge maps, we believe our method will also stimulate more deep learning researches for exemplar-based synthesis.
Acknowledgements.We would like to thanks Brandon Rowllet, Sifei Liu, Aysegul Dundar, Kevin Shih, Rafael Valle and Robert Pottorff for valuable discussions and proof-reading.
- New brodatz-based image databases for grayscale color and multiband texture analysis. Cited by: §4.1.
- User-controllable multi-texture synthesis with generative adversarial networks. arXiv preprint arXiv:1904.04751. Cited by: §2, §2.
Learning texture manifolds with the periodic spatial gan.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 469–477. Cited by: §2, §2.
- Material-specific adaptation of color invariant features. Pattern Recognition Letters 30 (3), pp. 306 – 313. External Links: Cited by: §4.1.
-  Outex texture database. External Links: Cited by: §4.1.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
- The synthesizability of texture examples. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
- Image melding: combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics (ToG) 31 (4), pp. 1–10. Cited by: §2.
- A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §1.
- Panoptic-based image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8070–8079. Cited by: §2.
- Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 341–346. Cited by: §1, §2.
- Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2, pp. 1033–1038. Cited by: §1, §2.
- THE kth-tips database. Cited by: §4.1.
- TileGAN: synthesis of large-scale non-homogeneous textures. ACM Transactions on Graphics (Proc. SIGGRAPH) 38 (4), pp. 58:1–58:11. Cited by: §2, §2.
- A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Cited by: §3.2.
Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, pp. 262–270. Cited by: Figure 1, §1, §2, §2, §4.2, Table 1.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
- Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 327–340. Cited by: §2.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In nips, Cited by: item 1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. External Links: Cited by: §2.
- Texture synthesis with spatial generative adversarial networks. CoRR abs/1611.08207. External Links: Cited by: §2, §2.
- Perceptual losses for real-time style transfer and super-resolution. CoRR abs/1603.08155. External Links: Cited by: §2.
- Self tuning texture optimization. In Computer Graphics Forum, Vol. 34, pp. 349–359. Cited by: Figure 1, §1, §2, §2, §2, §4.2, Table 1.
- Texture optimization for example-based synthesis. In ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, New York, NY, USA, pp. 795–802. External Links: Cited by: §1, §2, §2.
- Graphcut textures: image and video synthesis using graph cuts. In ACM Transactions on Graphics (ToG), Vol. 22, pp. 277–286. Cited by: §1, §2, §2.
- Photo-realistic single image super-resolution using a generative adversarial network. CoRR abs/1609.04802. External Links: Cited by: §2.
- Appearance-space texture synthesis. ACM Transactions on Graphics (TOG) 25 (3), pp. 541–548. Cited by: §2.
- Precomputed real-time texture synthesis with markovian generative adversarial networks. CoRR abs/1604.04382. External Links: Cited by: §2, §2.
- Diversified texture synthesis with feed-forward networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3920–3928. Cited by: §1, §1, §2, §2.
- Universal style transfer via feature transforms. CoRR abs/1705.08086. External Links: Cited by: Figure 1, §1, §2, §4.2, Table 1.
- Real-time texture synthesis by patch-based sampling. ACM Transactions on Graphics (ToG) 20 (3), pp. 127–150. Cited by: §2.
- Texture synthesis through convolutional neural networks and spectrum constraints. CoRR abs/1605.01141. External Links: Cited by: §2.
Partial convolution based padding. arXiv preprint arXiv:1811.11718. Cited by: §A.4.
- THE kth-tips2 database. pp. . Cited by: §4.1.
- VisTex vision texture database. pp. . Cited by: §4.1.
- A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision 40 (1), pp. 49–70. External Links: Cited by: §2, §2.
-  Shift-map image editing. In 2009 IEEE 12th International Conference on Computer Vision, pp. 151–158. Cited by: §2.
- Layered shape synthesis: automatic generation of control maps for non-stationary textures. In ACM SIGGRAPH Asia 2009 Papers, SIGGRAPH Asia ’09, New York, NY, USA, pp. 107:1–107:9. External Links: Cited by: §2, §2.
- Deep correlations for texture synthesis. ACM Transactions on Graphics (TOG) 36 (5), pp. 161. Cited by: §2, §2.
- Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: Figure 1, §1, §1, §2, §2, §4.2, Table 1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
- Texture networks: feed-forward synthesis of textures and stylized images. In International Conference on Machine Learning, pp. 1349–1357. Cited by: §2.
- High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §A.4, Figure 1, §1, §2, §4.2, Table 1.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: item 1.
- State of the art in example-based texture synthesis. In Eurographics ’09 State of the Art Reports (STARs), External Links: Cited by: §2.
Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 479–488. Cited by: §2.
- Feature matching and deformation for texture synthesis. ACM Transactions on Graphics (TOG) 23 (3), pp. 364–367. Cited by: §2.
- Optimized synthesis of art patterns and layered textures. IEEE transactions on visualization and computer graphics 20 (3), pp. 436–446. Cited by: §2.
- Texture mixer: a network for controllable synthesis and interpolation of texture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12164–12173. Cited by: Figure 1, §2, §4.2, §4.2, Table 1.
- Synthesis of progressively-variant textures on arbitrary surfaces. ACM Transactions on Graphics (TOG) 22 (3), pp. 295–302. Cited by: §2.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: item 1.
- Non-stationary texture synthesis by adversarial expansion. ACM Transactions on Graphics (Proc. SIGGRAPH) 37 (4), pp. 49:1–49:13. Cited by: Figure 1, §1, §1, §2, §2, §4.2, Table 1, §5.
Appendix A Framework Detail
a.1. Self-Similarity Computing & Transposed Convolution
The reviewers are welcome to check the attached animation video showing how a self-similarity map is computed and how the transposed convolution operation is performed.
a.2. Implementation Details for Computing Self-similarity Map
Computing self-similarity map can be efficiently implemented with the help of standard convolution operations. The formula for computing self-similarity map can be relaxed as the following:
Here, and indicate the overlapping region between current -shifted copy and the original copy. is the L2 norm of . The dominator is used for denormalization such that the scale of is independent of the scale of .
Implementation Details. can be computed by using as convolution input and a convolution filter with weights being all 1s and biases being all 0s. can be computed by using the zero-padded , with zero padding on top and bottom sides and zero padding on left and right sides as convolution input and as convolution filter. Similarly, can be computed by using a map, with the center region being 1 and other region being 0, as convolution input and as convolution filter.
a.3. Transposed Convolution Block
Table 5 lists the main differences between typical transposed convolution operation and our transposed convolution operation.
Fig. 12 shows the details for transposed convolution block in our framework.
|typical transposed conv operation||our transposed conv operation|
|input||output from previous layer||self-similarity map from encoded features|
|filter||learn-able parameters||feature maps from encoder|
|bias term||learn-able parameters||
avg-pooling of encoded features with linear transform
|filter size||small (e.g. 4x4, 3x3)||large (e.g. 8x8, 16x16, 32x32, 64x64)|
|stride||2(for upsampling purpose)||1|
a.4. Network Details
Table 7 shows the details of generator. The discriminator network is the same with pix2pixHD (Wang et al., 2018). We use partial convolution based padding (Liu et al., 2018) instead of zero padding for all the convolution layers.
|Block||Filter Size||# Filters||Stride/Up Factor||Sync BN||ReLU|
|(w/ Conv3_2)||SelfSimilarityMapBranch_Conv2||33||8 1||1||-||-|
|transposed Convolution Operation||filter: 256, input: 1 256||-||-||-|
|(w/ Conv4_2)||SelfSimilarityMapBranch_Conv1||33||1 8||1||-||Y|
|transposed Convolution Operation||filter: 512, input: 1 512||-||-||-|
|(w/ Conv5_2)||SelfSimilarityMapBranch_Conv1||33||1 8||1||-||Y|
|transposed Convolution Operation||filter: 1024, input: 1 1024||-||-||-|
(w/ TransConv_Block5 output)
|Sum (Conv6 + TransConv_Block4 output)||-||-||-||-||-|
|Sum (Conv7 + TransConv_Block3 output)||-||-||-||-||-|
Appendix B Additional Comparison
b.1. 256 to 512 Synthesis
In Table 6, we provide the quantitative comparisons for the synthesis results of 256 to 512.
b.2. 128 to 256 Synthesis
Non-stat. and Non-stat. baselines: we take the original code from the author’s github repository. The original training strategy for each training iteration is: 1). randomly crop a from the original big image() as the target image; 2). from the target image, randomly crop a image as the input image. Thus, for 128 to 256 synthesis, to train Non-stat. (without seeing ground truth image), for each training iteration, we randomly crop a image from the input image as target image then from the target image, we randomly crop a image as input. To train Non-stat.* (with directly seeing ground truth image), for each training iteration, we randomly crop a image from the ground truth image as the target image and then from the target image, we randomly crop a image as input image. For both Non-stat. and Non-stat.*, the inference stage will take image as input.
sinGAN and sinGAN baselines: for training with sinGAN, we used the original author’s implementation available on github. And we used the default settings the author provided in their source code. sinGAN code can synthesize textures in two different modes, one that generates a random variation which is of the same size as input texture (we directly using ground truth for training, denoted as sinGAN), and another that generates a texture of larger size (only using image, denoted as sinGAN).