User-Controllable Multi-Texture Synthesis with Generative Adversarial Networks

04/09/2019 ∙ by Aibek Alanov, et al. ∙ 0

We propose a novel multi-texture synthesis model based on generative adversarial networks (GANs) with a user-controllable mechanism. The user control ability allows to explicitly specify the texture which should be generated by the model. This property follows from using an encoder part which learns a latent representation for each texture from the dataset. To ensure a dataset coverage, we use an adversarial loss function that penalizes for incorrect reproductions of a given texture. In experiments, we show that our model can learn descriptive texture manifolds for large datasets and from raw data such as a collection of high-resolution photos. Moreover, we apply our method to produce 3D textures and show that it outperforms existing baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 12

page 13

page 14

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Textures are essential and crucial perceptual elements in computer graphics. They can be defined as images with repetitive or periodic local patterns. Texture synthesis models based on deep neural networks have recently drawn a great interest to a computer vision community. Gatys

[10, 12]

proposed to use a convolutional neural network as an effective texture feature extractor. They proposed to use a Gram matrix of hidden layers of a pre-trained VGG network as a descriptor of a texture. Follow-up papers

[16, 35, 23] significantly speed up a synthesis of texture by substituting an expensive optimization process in [10, 12] to a fast forward pass of a feed-forward convolutional network. However, these methods suffer from many problems such as generality inefficiency (i.e., train one network per texture) and poor diversity (i.e., synthesize visually indistinguishable textures).

Figure 2: Training pipeline of the proposed method
PSGAN [2] DTS [24] Ours
multi-texture
user control
dataset coverage
scalability with respect to dataset size
ability to learn textures from raw data
unsupervised texture detection
applicability to 3D
Table 1: Comparison of multi-texture synthesis methods

Recently, Periodic Spatial GAN (PSGAN) [2] and Diversified Texture Synthesis (DTS) [24] models were proposed as an attempt to partly solve these issues. PSGAN and DTS are multi-texture synthesis models, i.e., they train one network for generating many textures. However, each model has its own limitations (see Table  2). PSGAN has incomplete dataset coverage, and a user control mechanism is absent. Lack of dataset coverage means that it can miss some textures from the training dataset. The absence of a user control does not allow to explicitly specify the texture which should be generated by the model in PSGAN. DTS is not scalable with respect to dataset size, cannot be applied to learn textures from raw data and to synthesize 3D textures. It is not scalable because the number of parameters of the DTS model linearly depends on the number of textures in the dataset. The learning from raw data means that the input for the model is a high-resolution image as in Figure 1 and the method should extract textures in an unsupervised way. DTS does not support such training mode (which we call fully unsupervised) because for this model input textures should be specified explicitly. The shortage of generality to 3D textures in DTS model comes from inapplicability of VGG network to 3D images.

We propose a novel multi-texture synthesis model which does not have limitations of PSGAN and DTS methods. Our model allows for generating a user-specified texture from the training dataset. This is achieved by using an encoder network which learns a latent representation for each texture from the dataset. To ensure the complete dataset coverage of our method we use a loss function that penalizes for incorrect reproductions of a given texture. Thus, the generator is forced to learn the ability to synthesize each texture seen during the training phase. Our method is more scalable with respect to dataset size compared to DTS and is able to learn textures in a fully unsupervised way from raw data as a collection of high-resolution photos. We show that our model learns a descriptive texture manifold in latent space. Such low dimensional representations can be applied as useful texture descriptors, for example, for an unsupervised texture detection (see Figure 1). Also, we can apply our approach to 3D texture synthesis because we use fully adversarial losses and do not utilize VGG network descriptors.

We experimentally show that our model can learn large datasets of textures. We check that our generator learns all textures from the training dataset by conditionally synthesizing each of them. We demonstrate that our model can learn meaningful texture manifolds as opposed to PSGAN (see Figure 6). We compare the efficiency of our approach and DTS in terms of memory consumption and show that our model is much more scalable than DTS for large datasets.

We apply our method to 3D texture-like porous media structures which is a real-world problem from Digital Rock Physics. Synthesis of porous structures plays an important role [39] because an assessment of the variability in the inherent material properties is often experimentally not feasible. Moreover, usually it is necessary to acquire a number of representative samples of the void-solid structure. We show that our method outperforms a baseline [27] in the porous media synthesis which trains one network per texture.

Briefly summarize, we can highlight the following key advantages of our model:

  • user control (conditional generation),

  • full dataset coverage,

  • scalability with respect to dataset size,

  • ability to learn descriptive texture manifolds from raw data in a fully unsupervised way,

  • applicability to 3D texture synthesis.

Figure 3: The architecture of the discriminator on pairs .

2 Proposed Method

We look for a multi-texture synthesis pipeline that can generate textures in a user-controllable manner, ensure full dataset coverage and be scalable with respect to dataset size. We use an encoder network which allows to map textures to a latent space and gives low dimensional representations. We use a similar generator network as in PSGAN.

The generator

takes as an input a noise tensor

which has three parts . These parts are the same as in PSGAN:

  • is a global part which determines the type of texture. It consists of only one vector

    of size which is repeated through spatial dimensions.

  • is a local part and each element

    is sampled from a standard normal distribution

    independently. This part encourages the diversity within one texture.

  • is a periodic part and where are trainable functions and is sampled from independently. This part helps generating periodic patterns.

We see that for generating a texture it is sufficient to put the vector as an input to the generator because is obtained independently from and is computed from . It means that we can consider as a latent representation of a corresponding texture and we will train our encoder to recover this latent vector for an input texture . Further, we will assume that the generator takes only the vector as input and builds other parts of the noise tensor from it. For simplicity, we denote as .

The encoder takes an input texture and returns the distribution of the global vector (the same as ) of the texture .

Then we can formulate properties of the generator and the encoder we expect in our model:

  • samples are real textures if we sample from a prior distribution (in our case it is ).

  • if then has the same texture type as .

  • an aggregated distribution of the encoder should be close to the prior distribution , i.e. where is a true distribution of textures.

  • samples are real textures if is sampled from aggregated .

To ensure these properties we use three types of adversarial losses:

  • generator matching: for matching the distribution of both samples and reproductions to the distribution of real textures .

  • pair matching: for matching the distribution of pairs to the distribution of pairs where and are samples of the same texture. It will ensure that has the same texture type as .

  • encoder matching: for matching the aggregated distribution to the prior distribution .

We consider exact definitions of these adversarial losses in Section  2.1. We demonstrate the whole pipeline of the training procedure in Figure 2 and in Appendix B.

2.1 Generator & Encoder Objectives

Generator Matching. For matching both samples and reproductions to real textures we use a discriminator as in PSGAN which maps an input image to a two-dimensional tensor of spatial size . Each element of the discriminator’s output corresponds to a local part

and estimates probability that such receptive field is real versus synthesized by

. Then a value function of such adversarial game will be the following:

(1)

As in [13] we modify the value function for the generator by substituting the term to . So, the adversarial loss is

(2)

Pair Matching. The goal is to match fake pairs to real ones where and are samples of the same texture (in practice, we can obtain real pairs by taking two different random patches from one texture). For this purpose we use a discriminator of special architecture which is provided in Figure 3. The discriminator takes two input images and convolves them separately with the same convolutional layers. After obtaining embedded tensors with dimensions for each input image, we reshape each tensor to a matrix with size . Each row in these matrices represents an embedding for the corresponding receptive field in the initial images. Then we calculate pairwise element products of these two matrices which gives us a tensor with dimension . We convolve it with two convolutional layers using kernels and obtain a two-dimensional matrix of size . The element in the -th row and -th column of this matrix represents the mutual similarity between the corresponding receptive field in the first image and the one in the second image. Then we average this matrix row-wisely (for ) and column-wisely (for ). We obtain two vectors of sizes and reshape them into matrices with dimensions . To simplify the notation, we concatenate these matrices into a matrix , then we take element-wise sigmoid and output it as a matrix of discriminator’s probabilities like in PSGAN.

We consider the following distributions:

  • over real pairs where and are examples of the same texture;

  • over fake pairs where is a real texture and is its reproduction, i.e., .

We denote the dimension of the discriminator’s output matrix as and as the -th element of this matrix. The value function for this adversarial game is

(3)

The discriminator tries to maximize the value function while the generator and the encoder minimize it.

Then the adversarial loss is

(4)

To compute gradients with respect to parameters we use a reparameterization trick [18, 30, 34].

Encoder Matching. We need to use encoder matching because otherwise if we use only one objective for training the encoder then embeddings for textures can be very far from samples that come from the prior distribution . It will lead to unstable training of the generator because it should generate good images both for samples from the prior and for embeddings which come from the encoder .

Therefore, to regularize the encoder we match the prior distribution and the aggregated encoder distribution using the discriminator

. It classifies samples

from versus ones from . The minimax game of and is defined as , where is

(5)

To sample from we should at first sample some texture then sample from the encoder distribution by , where . The adversarial loss is

(6)

As for the loss , we compute gradients of with respect to using the reparameterization trick.

Final Objectives. Thus, for both the generator and the encoder we optimize the following objectives:

  • the generator loss

    (7)
  • the encoder loss

    (8)

In experiments, we use .

3 Related Work

Traditional texture synthesis models can broadly be divided into two categories: non-parametric and parametric. Non-parametric methods [7, 8, 19, 40] synthesize new texture by repeating and resampling local patches from the given example. Such techniques allow obtaining large textures. However, these approaches require heavy computations and can be slow. Parametric approaches [14, 28] consider an explicit model of textures by introducing statistical measures. To generate new texture, we should run an optimization process which matches the statistics of the synthesized image and a given texture. The method [28] shows good results in generating different textures. The main limitations of this approach are its high time complexity and the need to define handcrafted statistics for matching textures.

Deep learning methods were shown to be an efficient parametric model for texture synthesis. Papers of Gatys [10, 12] are a milestone: they proposed to use Gram matrices of VGG intermediate layer activations as texture descriptors. This approach allows for generating high-quality images of textures [10] by running an expensive optimization process. Subsequent works [35, 16, 23] significantly accelerate a texture synthesis by approximating this optimization procedure by fast feed-forward convolutional networks. Further works improve this approach either by using optimization techniques [9, 11, 22], introducing an instance normalization [37, 36] or applying GANs-based models for non-stationary texture synthesis [42]. These methods have significant limitations such as the requirement to train one network per texture and poor diversity of samples.

Multi-texture synthesis methods. DTS [24] was introduced by Li as a multi-texture synthesis model. It consists of one feed-forward convolutional network which takes one-hot vector corresponding to a specific texture and a noise vector, passes them through convolutional layers and generates an image. Such architecture makes DTS non-scalable for large datasets because the number of model parameters depends linearly on the dataset size. It cannot learn from raw data in a fully unsupervised way because input textures for this model should be specified explicitly by one-hot vectors. Also, this method is not applicable for 3D textures because it utilizes VGG Gram matrix descriptors which are suitable only for 2D images.

Spatial GAN (SGAN) model [15] was introduced by Jetchev as the first method where GANs [13] are applied to texture synthesis. It showed good results on certain textures, surpassing the results of [10]. Bergmann [2] improved SGAN by introducing Periodic Spatial GAN (PSGAN) model. It allows learning multiple textures due to an input noise in this method has a hierarchical structure. Since PSGAN optimizes only vanilla GAN loss it does not ensure the full dataset coverage. It is also known as mode collapse and it is considered as a common problem in GAN models [1, 29, 33]. Also this method does not allow conditional generating of textures, i.e. we cannot explicitly specify the texture which should be generated by the model.

Our model is based on GANs with an encoder network which allows mapping an input texture to a latent embedding. There are many different ways to train an autoencoding GANs

[31, 38, 3, 6, 5, 21, 43]. The main part in such models is the objective which is responsible for accurate reproduction of a given image by the model. Standard choices are and norms [31, 38, 43] or perceptual distances [3]. For textures, the VGG Gram matrix-based loss is more common [10, 35, 16]. We use the adversarial loss for this purpose inspired by [41] where it is used for image synthesis guided by sketch, color, and texture. The benefit of such loss is that it can be easily applied to 3D textures. Previous works [27, 39] on synthesizing 3D porous material used GANs-based methods with 3D convolutional layers inside a generator and a discriminator. However, they trained separate models for each texture. We show that our model allows to learn multiple 3D textures with a conditional generation ability.

4 Experiments

(a) PSGAN-5D samples
(b) Our-2D model samples
(c) Our-2D model reproductions. Columns 1,4,7,10 are real textures, others are reproductions
Figure 4: Examples of generated/reproduced textures from PSGAN and our model

In experiments, we train our model on scaly, braided, honeycomb and striped categories from Oxford Describable Textures Dataset [4]. These are datasets with natural textures in the wild. We use the same fully-convolutional architecture for , as in PSGAN [2]. We used a spectral normalization [26] for discriminators that significantly improved training stability. For we used similar architecture as for . Global dimension was found to be a sensitive parameter and we choose it separately for different models. The encoder network outputs a tensor with channels followed by global average pooling to get parameters , for encoding distribution . As in PSGAN model we fix and . For the discriminator we used the architecture described in Figure 3. A complete reference for network architectures can be found in Appendix C.

4.1 Inception Score for Textures

It is a common practice in natural image generation to evaluate a model that approximates data distribution using Inception Score [32]. For this purpose Inception network is used to get label distribution . Then one calculates

(9)

where is aggregated label distribution. The straightforward application of Inception network does not make sense for textures. Therefore, we train a classifier with an architecture similar***

The only difference is the number of output logits

to to predict texture types for a given texture dataset. To do that properly, we manually clean our data from duplicates so that every texture example has a distinct label and use random cropping as data augmentation. Our trained classifier achieves 100% accuracy on a scaly dataset. We use this classifier to evaluate Inception Score for models trained on the same texture dataset.

Model Uncond. IS Cond. IS
PSGAN-5D 73.680.6 NA
Our-2D 73.740.3 103.960.1
Table 2: Inception Scores for conditional and unconditional generation from PSGAN and our model. Classifier used to compute IS achieved perfect accuracy on train data.

4.2 Unconditional and Conditional Generation

(a) PSGAN-5D samples
(b) Our-2D model samples
(c) Our-2D model reproductions
Figure 5: Histogram of classifier predictions on 50000 generated samples from PSGAN (a) and Our model (b) and for 500 reproductions per class for our model (c). Each bin represents a separate texture class.

For models like PSGAN we are not able to obtain reproductions, we only have access to texture generation process. One would ask for the guarantees that a model is able to generate every texture in the dataset from only the prior distribution. We evaluate PSGAN and our model on a scaly dataset with 116 unique textures. After models are trained, we estimate the Inception Score. We observed that Inception Score differs with and thus picked the best separately for both PSGAN and our model obtaining and respectively. Both models were trained with Adam [17] (betas=0.5,0.99) with batch size 64 on a single GPU. Their best performance was achieved in around 80k iterations. For both models, we used spectral normalization to improve training stability [26].

Both models can generate high-quality textures from low dimensional space. Our model additionally can successfully generate reproductions for every texture in the dataset. Figure 5 and Table 2 summarise the results for conditional (reproductions) and unconditional texture generation. Figure  indicates PSGAN may have missing textures, our model does not suffer from this issue. Inception Score suggests that conditional generation is a far better way to sample from the model. In Figure 4 we provide samples and texture reproductions for trained models. A larger set of samples and reproductions for every texture can be found in Appendix A.1 along with evaluations on braided, honeycomb and striped categories from Oxford Describable Textures Dataset.

4.3 Texture Manifold

(a) PSGAN 2D manifold
(b) Our model 2D manifold
Figure 6: 2D manifold for 116 textures from scaly dataset. Our model gives paces one texture to a distinct location. Grid is taken over with step

Autoencoding property is a nice to have feature for generative models. One can treat embeddings as low dimensional data representations. As shown in section 4.2 our model can reconstruct every texture in the dataset. Moreover, we are able to visualize the manifold of textures since we trained this model with . To compare this manifold to PSGAN, we train a separate PSGAN model with . 2D manifolds near prior distribution for both models can be found in Figure 6

. Our model learns visually better 2D manifold and allocates similar textures nearby. Visualizations for manifolds while training (for different epochs) can be found in Appendix 

A.1.

4.4 Learning Texture Manifolds from Raw Data

The learned manifold in section 4.3 was obtained from well prepared data. Real cases usually do not have clean data and require either expensive data preparation or unsupervised methods. With minor corrections in data preparation pipeline, our model can learn texture manifolds from raw data such as a collection of high-resolution photos. To cope with training texture manifolds on raw data, we suggest to construct in equation 3 with two crops from almost the same location with the stochastic procedure described in Algorithm 1. In Figure 7 we provide a manifold learned from House photo.

Given: random image , crop size , window size (default )
Return:
Algorithm 1 Obtaining for the training on raw data

4.5 Spatial Embeddings and Texture Detection

As described in sections 4.4 and 4.3, our method can learn descriptive texture manifold from a collection of raw data in an unsupervised way. The obtained texture embeddings may be useful. Consider a large input image , as the first in Figure 1, and the trained and on this image. Note that at the training stage encoder is a fully convolutional network, followed by global average pooling. Applied to as-is, the encoder’s output would be ”average” texture embedding for the whole image . Replacing global average pooling by spatial average pooling with small kernel allows to output texture embeddings for each receptive field in the input image . We denote such modified encoder as .

is a tensor with spatial texture embeddings for X. They smoothly change along spatial dimensions as visualized by reconstructing them with generator (described in Appendix C) on the third picture in Figure 1.

One can take a reference patch with a texture (e.g., grass) and find similar textures in image . This is illustrated in the last picture in Figure 1. We picked a patch with grass on it and constructed a heatmap

(10)

where is Euclidean distance and

in our example. We then interpolated

to the original size of .

This example shows that allows using learned embeddings for other tasks that have no relation to texture generation. We believe supervised methods would benefit from adding additional features obtained in an unsupervised way.

4.6 Memory Complexity

In this section, we compare the scalability of DTS and our model with respect to dataset size. We denote the number of parameters as , the dataset size as . The number of parameters of DTS model is***we use the official implementation from this github page in file ”Model_multi_texture_synthesis.lua”

(11)

We should note that DTS depends on which is the size of the whole dataset while the number of unique textures in the dataset can be much more smaller than . Therefore, the method is not scalable to large datasets with duplications. To reduce memory complexity, DTS requires labeling. It will allow the method to find unique textures and set the size of the one-hot vector to the number of different texture types. Our model learns textures in an unsupervised way and instead of one-hot vector uses a low dimensional representation of textures. In Section 4.4 we show that our method can detect different textures from high-resolution image without labeling. It means that our model complexity depends mostly on the number of unique textures in the dataset. The number of parameters of our model (the generator and the encoder) is

(12)

where is the size of latent vector in our model, which consists of three parts . In experiments, we show that the dimension (, , ) is sufficient to learn 116 unique textures.

For example, let us consider a dataset of size 5000 which contains 100 unique textures with 50 variations per each one. Then for our model will be 26 and the number of parameters will be . Meanwhile, DTS will require and parameters. We see that in this case, our model memory consumption is less by approximately 20 times than DTS.

Figure 7: Merrigum House photo and its learned 2D texture manifold using our model
Permeability Euler characteristic Surface area
Ketton 5.06 0.35 4.68 0.56 3.66 0.73 1.86 0.42 1.85 0.62 7.73 0.18
Berea 0.49 0.07 0.50 0.12 0.34 0.08 1.36 0.25 0.33 0.11 5.91 0.54
Doddington 0.42 0.10 3.41 1.68 2.65 2.29 3.35 1.13 4.83 2.06 7.92 0.27
Estaillades 0.80 0.24 3.41 0.46 1.85 0.29 2.05 1.05 4.62 0.66 6.93 0.39
Bentheimer 0.47 0.08 1.38 0.49 1.24 0.41 3.44 1.91 1.20 0.73 1.25 0.12
Table 3: KL divergence between real, our and the baseline distributions of statistics (permeability, Euler characteristic, and surface area) for size

. The standard deviation was computed using the bootstrap method with

resamples

4.7 Application to 3D Porous Media Synthesis

In this section, we demonstrate the applicability of our model to the Digital Rock Physics. We trained our model on 3D Porous Media structures***All samples were taken from this site (i.e. see Fig. 7(a)) of five different types: Ketton, Berea, Doddington, Estaillades and Bentheimer. Each type of rock has an initial size binary voxels. As the baseline, we considered Porous Media GANs [27], which is deep convolutional GANs with 3D convolutional layers.

(a) Real
(b) Ours
(c) Baseline
Figure 8: Real, synthetic (our model) and synthetic (baseline model) Berea samples of size
Figure 9: Synthetic Berea sample of size generated with our model

For the comparison of our model with real samples and the baseline samples, we use permeability statistics and two so-called Minkowski functionals [20]. The permeability is a measure of the ability of a porous material to allow fluids to pass through it. Minkowski functionals describe the morphology and topology of 3D binary structures. In our experiments, we used two functionals: Surface area and Euler characteristic. If the considered measures on synthetic samples are close to that on real ones, it will guarantee that the synthetic samples are valid for Digital Rock Physics applications.

We used the following experimental setup. We trained our model on random crops of size on all types of porous structures. We also trained five baseline models on each type separately. Then we generated synthetic samples of size of each type using our model and the baseline model. We also cropped samples of size from the real data. As a result, for each type of structure, we obtained three sets of objects: real, synthetic and baseline.

The visual result of the synthesis is presented in Fig. 8 for Berea. In the figure, there are three samples: real (i.e., cropped from the original big sample), ours and a sample of the baseline model. Other types of porous materials along with architecture details are presented in Appendix E. Because our model is fully convolutional, we can increase the generated sample size by expanding the spatial dimensions of the latent embedding . We demonstrate the synthesized 3D porous media of size in Figure 9. Then,

  1. For each real, synthetic and baseline objects we calculated three statistics: permeability, Surface Area and Euler characteristics.

  2. To measure the distance between distributions of statistics for real, our and baseline samples we approximated these distributions by discrete ones obtained using the histogram method with bins.

  3. Then for each statistic, we calculated KL divergence between the distributions of the statistic of a) real and our generated samples; b) real and baseline generated samples.

The comparison of the KL divergences is presented at Tab. 3 for the permeability and for Minkowski functionals. As we can see, our model performs better accordingly for most types of porous structures.

In this section, we showed the application of our model to Digital Rock Physics. Our model outperforms the baseline in most of the cases what proves its usefulness in solving real-world problems. Moreover, its critical advantage is the ability to generate multiple textures with the same model.

5 Conclusion

In this paper, we proposed a novel model for multi-texture synthesis. We showed it ensures full dataset coverage and can detect textures on images in the unsupervised setting. We provided a way to learn a manifold of training textures even from a collection of raw high-resolution photos. We also demonstrated that the proposed model applies to the real-world 3D texture synthesis problem: porous media generation. Our model outperforms the baseline by better reproducing physical properties of real data. In future work, we want to study the texture detection ability of our model and seek for its new applications.

Acknowledgements.

Aibek Alanov, Max Kochurov, Dmitry Vetrov were supported by Samsung Research, Samsung Electronics. The work of E. Burnaev and D. Volkhonskiy was supported by The Ministry of Education and Science of Russian Federation, grant No.14.615.21.0004, grant code: RFMEFI61518X0004. The authors E. Burnaev and D. Volkhonskiy acknowledge the usage of the Skoltech CDISE HPC cluster Zhores for obtaining some results presented in this paper.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors,

    Proceedings of the 34th International Conference on Machine Learning

    , volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [2] U. Bergmann, N. Jetchev, and R. Vollgraf. Learning texture manifolds with the periodic spatial gan. ICML, 2017.
  • [3] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. ICLR, 2017.
  • [4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In

    Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

    , CVPR ’14, pages 3606–3613, Washington, DC, USA, 2014. IEEE Computer Society.
  • [5] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. ICLR, 2017.
  • [6] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. Proc. of ICLR, 2017.
  • [7] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341–346. ACM, 2001.
  • [8] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In iccv, page 1033. IEEE, 1999.
  • [9] O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 553–561, 2016.
  • [10] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015.
  • [11] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman. Preserving color in neural artistic style transfer. arXiv preprint arXiv:1606.05897, 2016.
  • [12] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [14] D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–238. ACM, 1995.
  • [15] N. Jetchev, U. Bergmann, and R. Vollgraf. Texture synthesis with spatial generative adversarial networks. arXiv preprint arXiv:1611.08207, 2016.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision, pages 694–711. Springer, 2016.
  • [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [18] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
  • [19] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. ACM Transactions on Graphics (ToG), 22(3):277–286, 2003.
  • [20] D. Legland, K. Kiêu, and M.-F. Devaux. Computation of minkowski measures on 2d and 3d binary images. Image Analysis & Stereology, 26(2):83–92, 2011.
  • [21] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin.

    Alice: Towards understanding adversarial learning for joint distribution matching.

    In Advances in Neural Information Processing Systems, pages 5495–5503, 2017.
  • [22] C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2479–2486, 2016.
  • [23] C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016.
  • [24] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In Proc. CVPR, 2017.
  • [25] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. ICLR, 2016.
  • [26] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
  • [27] L. Mosser, O. Dubrule, and M. J. Blunt. Reconstruction of three-dimensional porous media using generative adversarial neural networks. Physical Review E, 96(4):043309, 2017.
  • [28] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
  • [29] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [30] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    ICML, 2014.
  • [31] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
  • [32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818–2826, 2016.
  • [33] H. Thanh-Tung, T. Tran, and S. Venkatesh. Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984, 2019.
  • [34] M. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In International Conference on Machine Learning, pages 1971–1979, 2014.
  • [35] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, pages 1349–1357, 2016.
  • [36] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: the missing ingredient for fast stylization. corr abs/1607.0 (2016).
  • [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6924–6932, 2017.
  • [38] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. It takes (only) two: Adversarial generator-encoder networks. In AAAI. AAAI Press, 2018.
  • [39] D. Volkhonskiy, E. Muravleva, O. Sudakov, D. Orlov, B. Belozerov, E. Burnaev, and D. Koroteev. Reconstruction of 3d porous media from 2d slices. arXiv preprint arXiv:1901.10233, 2019.
  • [40] L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 479–488. ACM Press/Addison-Wesley Publishing Co., 2000.
  • [41] W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8456–8465, 2018.
  • [42] Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang. Non-stationary texture synthesis by adversarial expansion. arXiv preprint arXiv:1805.04487, 2018.
  • [43] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.

Appendix

Appendix A Additional Experiments

a.1 Results on Datasets: Scaly, Braided, Honeycomb, Striped

In this appendix section, we provide results for all datasets mentioned in the main text as attached files. The below listing of files is a reference to these files.

  1. Scaly Dataset

    • scaly_ours_2D_samples.jpg grid of samples images for our model with

    • scaly_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

    • scaly_ours_40D_samples.jpg grid of samples images for our model with

    • scaly_ours_40D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

    • movie-2d-plane-ours.gif – a visualization for the training process of the 2D latent space for our model

    • movie-2d-plane-psgan.gif – a visualization for the training process of the 2D latent space for PSGAN model

  2. Braided Dataset

    • braided_ours_2D_samples.jpg grid of samples images for our model with

    • braided_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

    • braided_ours_40D_samples.jpg grid of samples images for our model with

    • braided_ours_40D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

  3. Honeycomb Dataset

    • honeycomb_ours_2D_samples.jpg grid of samples images for our model with

    • honeycomb_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

    • honeycomb_ours_40D_samples.jpg grid of samples images for our model with

    • honeycomb_ours_40D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

  4. Striped Dataset

    • striped_ours_2D_samples.jpg grid of samples images for our model with

    • striped_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

a.2 Results on a Collection of Raw Images

In this part, we train our model with for 5 high-resolution images (see Figure 10) in a fully unsupervised way. We show that our method learns a descriptive manifold of textures from these images (see Figure 11) which we can use for texture detection. We demonstrate that we can apply this technique for unseen images (see Figure 12).

(a) Size
(b) Size
(c) Size
(d) Size
(e) Size
Figure 10: Training images
Figure 11: Learned manifold from training images
(a) Initial image ()
(b) Grass texture is detected
(c) Initial image ()
(d) Mountain texture is detected
(e) Initial image ()
(f) Field texture is detected
Figure 12: Examples of texture detection by our model for unseen images.

Appendix B Algorithm Description

The main algorithm for the training pipeline is presented in Algorithm 2 and visualised in Figure 2. Additionally, it is possible to add regularization for embeddings along with

that will increase variance of

distribution. This term does not worsen the results added with weight for , or with for but slightly improves training stability. The improvement is marginal and optional and for this reason we omitted this in the main text. We also did not use this loss term to train models presented in the main text.

repeat
      Draw textures from the dataset
      For each draw another example of the same texture
      Draw noise samples from prior where
     
     
      Sample embeddings for each from the encoder distribution
      Sample textures from prior
      Sample reproductions for each real texture
     
Compute the loss of the discriminator
     
Compute the loss of the discriminator
     
Compute the loss of the discriminator
     
Compute the loss of the generator
     
Compute the loss of the encoder
      Gradient update on discriminator networks
     
      Gradient update on generator-encoder networks
until convergence
Algorithm 2 The training algorithm of the proposed method.

Appendix C Network Architectures

c.1 Encoder Network Architecture

The architecture of the encoder is similar to the discriminator architecture. The first difference is that batch norm layers are added. The second one is that convolutional layers are followed by global average pooling to obtain a single embedding for an input texture. See Table 4.

Layer Output size Parameters
Input
Conv2d

kernel=5, stride=2, pad=2

LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2
AdaptiveAvgPool2d output_size=1
Table 4: Architecture description for the encoder

c.2 Generator Network Architecture

The architecture for the generator is taken from the PSGAN model without any changes. See Table 5.

Details on

As mentioned in Section 4.5, we need to modify to obtain ”reconstructed” picture in Figure 1. This is done modifying only Compute Period Coefs part in the generator by replacing Linear layer with Conv1x1 with the same weight matrix. Previously, we had shared period coefficients along spatial dimensions and they were dependent only on one global code . Once we apply (replacing global pooling with spatial pooling in ) to an image, we obtain varying ”global” codes along spatial dimension. Conv1x1 allows to efficiently compute periodic coefficients for every spatial position to obtain . , note, that random offset is manually set to zero. Then global tensor is stacked with and to get that is passed to the Upsampling part in the generator. As the generator is fully convolutional, we are free in an input spatial size.

Layer Output size Parameters
Upsampling part
Input
ConvTranspose2d kernel=5, stride=2, pad=2, output_pad=1, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
ConvTranspose2d kernel=5, stride=2, pad=2, output_pad=1, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
ConvTranspose2d kernel=5, stride=2, pad=2, output_pad=1, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
ConvTranspose2d kernel=5, stride=2, pad=2, output_pad=1, bias=False
BatchNorm2d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.2
ConvTranspose2d kernel=5, stride=2, pad=2, output_pad=1, bias=False
Tanh
Compute Period Coefs
Input
Linear
ReLU
Linear
Table 5: Architecture description for the generator

c.3 Architectures of Discriminator Networks

c.3.1 Discriminator

The architecture for discriminator is taken from PSGAN model with added spectral norm in it. Spectral norm improves the stability of training. See Table 6.

Layer Output size Parameters
Input
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
Table 6: Architecture description for the texture discriminator

c.3.2 Discriminator

The proposed architecture for the discriminator on pairs the Convolutional part is same as for except the last number of channels. The output for two images constructs a matrix of size and Conv 1x1 part is applied to this matrix to obtain spatial predictions for each image. This architecture is symmetric with respect to input order and can work with different sized images pairs (we did not require this feature in our algorithm). See Table 7.

Layer Output size Parameters
Convolutional part
Input
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=5, stride=2, pad=2 + spectral_norm
Conv 1x1 part
Input
Conv2d kernel=1, stride=1 + spectral_norm
LeakyReLU slope=0.2
Conv2d kernel=1, stride=1 + spectral_norm
Table 7: Architecture description for the pair discriminator

c.3.3 Discriminator

Following recent works [25] motivated to use adversarial trainign scheme for latent representations. The other benefit from using an additional discriminator is to make loss terms to be at the same scale. See Table 8.

Layer Output size Parameters
Input
Linear
LeakyReLU slope=0.2
Linear
LeakyReLU slope=0.2
Linear
Table 8: Architecture description for the latent discriminator

Appendix D Hyperparameters

We used the set of hyperparameters to train the model on 116 textures from scaly provided in Table 

9 and Table 10.

Hyperparameter Value
crop size from image (160, 160)
batch size 64
spectral normalization for discriminators True
number of steps for discriminator per 1 step of generator 1
iterations 100000
latent prior
2
20
4
Table 9: General hyperparameters of the model
Hyperparameter Value
initialization for weights
optimizer adam
adam betas 0.5, 0.999
learning rate 0.0002
weight decay 0.0001
Table 10: Network specific hyperparameters for , , , ,

Appendix E 3D Porous Media Synthesis

In this section, we describe network architectures, hyperparameters and experiments for the 3D porous media generation.

e.1 Network Architectures

Architectures for 3D porous media synthesis have almost the same structure as for 2D textures. The main differences are the following:

  1. instead of Conv2D (TransposedConv2D) layers we used Conv3D (TransposedConv3D) layers;

  2. we do not use periodical latent component since there is no need in periodicity in porous structures.

In order to honestly compare our model with the baseline, we used the same generator and discriminator networks in both our model and the baseline.

e.1.1 3D Encoder Network Architecture

The architecture of the 3D encoder is presented in Table 11.

Layer Output size Parameters
Input
Conv3d kernel=4, stride=1, pad=0
LeakyReLU slope=0.01
Conv3d kernel=4, stride=2, pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
Conv3d kernel=4, stride=2, pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
Conv3d kernel=4, stride=2, pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
Conv3d kernel=4, stride=2, pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
Conv3d kernel=1, stride=1, pad=0
AdaptiveAvgPool3d output_size=1
Table 11: Architecture description for the 3D encoder

e.1.2 3D Generator Network Architecture

The architecture of the 3D generator is presented in Table 12. The same generator architecture was used in the baseline.

Layer Output size Parameters
Upsampling part
Input
ConvTranspose3d kernel=4, stride=1, pad=0, output_pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
ConvTranspose3d kernel=4, stride=2, pad=1, output_pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
ConvTranspose3d kernel=4, stride=2, pad=1, output_pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
ConvTranspose3d kernel=4, stride=2, pad=1, output_pad=1, bias=False
BatchNorm3d eps=1e-05, momentum=1.0, affine=True
LeakyReLU slope=0.01
ConvTranspose3d kernel=4, stride=2, pad=1, output_pad=1, bias=False
Tanh
Table 12: Architecture description for the 3D generator

e.1.3 Architectures of the 3D Discriminator Network

The architecture of the 3D texture discriminator is presented in Table 13. The same discriminator architecture was used in the baseline.

Layer Output size Parameters
Input
Conv3d kernel=4, stride=2, pad=1 + spectral_norm
LeakyReLU slope=0.01
Conv3d kernel=4, stride=2, pad=1 + spectral_norm
LeakyReLU slope=0.01
Conv3d kernel=4, stride=2, pad=1
LeakyReLU