1 Introduction
Textures are essential and crucial perceptual elements in computer graphics. They can be defined as images with repetitive or periodic local patterns. Texture synthesis models based on deep neural networks have recently drawn a great interest to a computer vision community. Gatys
[10, 12]proposed to use a convolutional neural network as an effective texture feature extractor. They proposed to use a Gram matrix of hidden layers of a pretrained VGG network as a descriptor of a texture. Followup papers
[16, 35, 23] significantly speed up a synthesis of texture by substituting an expensive optimization process in [10, 12] to a fast forward pass of a feedforward convolutional network. However, these methods suffer from many problems such as generality inefficiency (i.e., train one network per texture) and poor diversity (i.e., synthesize visually indistinguishable textures).PSGAN [2]  DTS [24]  Ours  

multitexture  ✓  ✓  ✓ 
user control  ✓  ✓  
dataset coverage  ✓  ✓  
scalability with respect to dataset size  ✓  ✓  
ability to learn textures from raw data  ✓  ✓  
unsupervised texture detection  ✓  
applicability to 3D  ✓  ✓ 
Recently, Periodic Spatial GAN (PSGAN) [2] and Diversified Texture Synthesis (DTS) [24] models were proposed as an attempt to partly solve these issues. PSGAN and DTS are multitexture synthesis models, i.e., they train one network for generating many textures. However, each model has its own limitations (see Table 2). PSGAN has incomplete dataset coverage, and a user control mechanism is absent. Lack of dataset coverage means that it can miss some textures from the training dataset. The absence of a user control does not allow to explicitly specify the texture which should be generated by the model in PSGAN. DTS is not scalable with respect to dataset size, cannot be applied to learn textures from raw data and to synthesize 3D textures. It is not scalable because the number of parameters of the DTS model linearly depends on the number of textures in the dataset. The learning from raw data means that the input for the model is a highresolution image as in Figure 1 and the method should extract textures in an unsupervised way. DTS does not support such training mode (which we call fully unsupervised) because for this model input textures should be specified explicitly. The shortage of generality to 3D textures in DTS model comes from inapplicability of VGG network to 3D images.
We propose a novel multitexture synthesis model which does not have limitations of PSGAN and DTS methods. Our model allows for generating a userspecified texture from the training dataset. This is achieved by using an encoder network which learns a latent representation for each texture from the dataset. To ensure the complete dataset coverage of our method we use a loss function that penalizes for incorrect reproductions of a given texture. Thus, the generator is forced to learn the ability to synthesize each texture seen during the training phase. Our method is more scalable with respect to dataset size compared to DTS and is able to learn textures in a fully unsupervised way from raw data as a collection of highresolution photos. We show that our model learns a descriptive texture manifold in latent space. Such low dimensional representations can be applied as useful texture descriptors, for example, for an unsupervised texture detection (see Figure 1). Also, we can apply our approach to 3D texture synthesis because we use fully adversarial losses and do not utilize VGG network descriptors.
We experimentally show that our model can learn large datasets of textures. We check that our generator learns all textures from the training dataset by conditionally synthesizing each of them. We demonstrate that our model can learn meaningful texture manifolds as opposed to PSGAN (see Figure 6). We compare the efficiency of our approach and DTS in terms of memory consumption and show that our model is much more scalable than DTS for large datasets.
We apply our method to 3D texturelike porous media structures which is a realworld problem from Digital Rock Physics. Synthesis of porous structures plays an important role [39] because an assessment of the variability in the inherent material properties is often experimentally not feasible. Moreover, usually it is necessary to acquire a number of representative samples of the voidsolid structure. We show that our method outperforms a baseline [27] in the porous media synthesis which trains one network per texture.
Briefly summarize, we can highlight the following key advantages of our model:

user control (conditional generation),

full dataset coverage,

scalability with respect to dataset size,

ability to learn descriptive texture manifolds from raw data in a fully unsupervised way,

applicability to 3D texture synthesis.
2 Proposed Method
We look for a multitexture synthesis pipeline that can generate textures in a usercontrollable manner, ensure full dataset coverage and be scalable with respect to dataset size. We use an encoder network which allows to map textures to a latent space and gives low dimensional representations. We use a similar generator network as in PSGAN.
The generator
takes as an input a noise tensor
which has three parts . These parts are the same as in PSGAN:
is a global part which determines the type of texture. It consists of only one vector
of size which is repeated through spatial dimensions. 
is a local part and each element
is sampled from a standard normal distribution
independently. This part encourages the diversity within one texture. 
is a periodic part and where are trainable functions and is sampled from independently. This part helps generating periodic patterns.
We see that for generating a texture it is sufficient to put the vector as an input to the generator because is obtained independently from and is computed from . It means that we can consider as a latent representation of a corresponding texture and we will train our encoder to recover this latent vector for an input texture . Further, we will assume that the generator takes only the vector as input and builds other parts of the noise tensor from it. For simplicity, we denote as .
The encoder takes an input texture and returns the distribution of the global vector (the same as ) of the texture .
Then we can formulate properties of the generator and the encoder we expect in our model:

samples are real textures if we sample from a prior distribution (in our case it is ).

if then has the same texture type as .

an aggregated distribution of the encoder should be close to the prior distribution , i.e. where is a true distribution of textures.

samples are real textures if is sampled from aggregated .
To ensure these properties we use three types of adversarial losses:

generator matching: for matching the distribution of both samples and reproductions to the distribution of real textures .

pair matching: for matching the distribution of pairs to the distribution of pairs where and are samples of the same texture. It will ensure that has the same texture type as .

encoder matching: for matching the aggregated distribution to the prior distribution .
We consider exact definitions of these adversarial losses in Section 2.1. We demonstrate the whole pipeline of the training procedure in Figure 2 and in Appendix B.
2.1 Generator & Encoder Objectives
Generator Matching. For matching both samples and reproductions to real textures we use a discriminator as in PSGAN which maps an input image to a twodimensional tensor of spatial size . Each element of the discriminator’s output corresponds to a local part
and estimates probability that such receptive field is real versus synthesized by
. Then a value function of such adversarial game will be the following:(1)  
As in [13] we modify the value function for the generator by substituting the term to . So, the adversarial loss is
(2)  
Pair Matching. The goal is to match fake pairs to real ones where and are samples of the same texture (in practice, we can obtain real pairs by taking two different random patches from one texture). For this purpose we use a discriminator of special architecture which is provided in Figure 3. The discriminator takes two input images and convolves them separately with the same convolutional layers. After obtaining embedded tensors with dimensions for each input image, we reshape each tensor to a matrix with size . Each row in these matrices represents an embedding for the corresponding receptive field in the initial images. Then we calculate pairwise element products of these two matrices which gives us a tensor with dimension . We convolve it with two convolutional layers using kernels and obtain a twodimensional matrix of size . The element in the th row and th column of this matrix represents the mutual similarity between the corresponding receptive field in the first image and the one in the second image. Then we average this matrix rowwisely (for ) and columnwisely (for ). We obtain two vectors of sizes and reshape them into matrices with dimensions . To simplify the notation, we concatenate these matrices into a matrix , then we take elementwise sigmoid and output it as a matrix of discriminator’s probabilities like in PSGAN.
We consider the following distributions:

over real pairs where and are examples of the same texture;

over fake pairs where is a real texture and is its reproduction, i.e., .
We denote the dimension of the discriminator’s output matrix as and as the th element of this matrix. The value function for this adversarial game is
(3) 
The discriminator tries to maximize the value function while the generator and the encoder minimize it.
Then the adversarial loss is
(4) 
To compute gradients with respect to parameters we use a reparameterization trick [18, 30, 34].
Encoder Matching. We need to use encoder matching because otherwise if we use only one objective for training the encoder then embeddings for textures can be very far from samples that come from the prior distribution . It will lead to unstable training of the generator because it should generate good images both for samples from the prior and for embeddings which come from the encoder .
Therefore, to regularize the encoder we match the prior distribution and the aggregated encoder distribution using the discriminator
. It classifies samples
from versus ones from . The minimax game of and is defined as , where is(5)  
To sample from we should at first sample some texture then sample from the encoder distribution by , where . The adversarial loss is
(6) 
As for the loss , we compute gradients of with respect to using the reparameterization trick.
Final Objectives. Thus, for both the generator and the encoder we optimize the following objectives:

the generator loss
(7) 
the encoder loss
(8)
In experiments, we use .
3 Related Work
Traditional texture synthesis models can broadly be divided into two categories: nonparametric and parametric. Nonparametric methods [7, 8, 19, 40] synthesize new texture by repeating and resampling local patches from the given example. Such techniques allow obtaining large textures. However, these approaches require heavy computations and can be slow. Parametric approaches [14, 28] consider an explicit model of textures by introducing statistical measures. To generate new texture, we should run an optimization process which matches the statistics of the synthesized image and a given texture. The method [28] shows good results in generating different textures. The main limitations of this approach are its high time complexity and the need to define handcrafted statistics for matching textures.
Deep learning methods were shown to be an efficient parametric model for texture synthesis. Papers of Gatys [10, 12] are a milestone: they proposed to use Gram matrices of VGG intermediate layer activations as texture descriptors. This approach allows for generating highquality images of textures [10] by running an expensive optimization process. Subsequent works [35, 16, 23] significantly accelerate a texture synthesis by approximating this optimization procedure by fast feedforward convolutional networks. Further works improve this approach either by using optimization techniques [9, 11, 22], introducing an instance normalization [37, 36] or applying GANsbased models for nonstationary texture synthesis [42]. These methods have significant limitations such as the requirement to train one network per texture and poor diversity of samples.
Multitexture synthesis methods. DTS [24] was introduced by Li as a multitexture synthesis model. It consists of one feedforward convolutional network which takes onehot vector corresponding to a specific texture and a noise vector, passes them through convolutional layers and generates an image. Such architecture makes DTS nonscalable for large datasets because the number of model parameters depends linearly on the dataset size. It cannot learn from raw data in a fully unsupervised way because input textures for this model should be specified explicitly by onehot vectors. Also, this method is not applicable for 3D textures because it utilizes VGG Gram matrix descriptors which are suitable only for 2D images.
Spatial GAN (SGAN) model [15] was introduced by Jetchev as the first method where GANs [13] are applied to texture synthesis. It showed good results on certain textures, surpassing the results of [10]. Bergmann [2] improved SGAN by introducing Periodic Spatial GAN (PSGAN) model. It allows learning multiple textures due to an input noise in this method has a hierarchical structure. Since PSGAN optimizes only vanilla GAN loss it does not ensure the full dataset coverage. It is also known as mode collapse and it is considered as a common problem in GAN models [1, 29, 33]. Also this method does not allow conditional generating of textures, i.e. we cannot explicitly specify the texture which should be generated by the model.
Our model is based on GANs with an encoder network which allows mapping an input texture to a latent embedding. There are many different ways to train an autoencoding GANs
[31, 38, 3, 6, 5, 21, 43]. The main part in such models is the objective which is responsible for accurate reproduction of a given image by the model. Standard choices are and norms [31, 38, 43] or perceptual distances [3]. For textures, the VGG Gram matrixbased loss is more common [10, 35, 16]. We use the adversarial loss for this purpose inspired by [41] where it is used for image synthesis guided by sketch, color, and texture. The benefit of such loss is that it can be easily applied to 3D textures. Previous works [27, 39] on synthesizing 3D porous material used GANsbased methods with 3D convolutional layers inside a generator and a discriminator. However, they trained separate models for each texture. We show that our model allows to learn multiple 3D textures with a conditional generation ability.4 Experiments
In experiments, we train our model on scaly, braided, honeycomb and striped categories from Oxford Describable Textures Dataset [4]. These are datasets with natural textures in the wild. We use the same fullyconvolutional architecture for , as in PSGAN [2]. We used a spectral normalization [26] for discriminators that significantly improved training stability. For we used similar architecture as for . Global dimension was found to be a sensitive parameter and we choose it separately for different models. The encoder network outputs a tensor with channels followed by global average pooling to get parameters , for encoding distribution . As in PSGAN model we fix and . For the discriminator we used the architecture described in Figure 3. A complete reference for network architectures can be found in Appendix C.
4.1 Inception Score for Textures
It is a common practice in natural image generation to evaluate a model that approximates data distribution using Inception Score [32]. For this purpose Inception network is used to get label distribution . Then one calculates
(9) 
where is aggregated label distribution. The straightforward application of Inception network does not make sense for textures. Therefore, we train a classifier with an architecture similar^{*}^{*}*
The only difference is the number of output logits
to to predict texture types for a given texture dataset. To do that properly, we manually clean our data from duplicates so that every texture example has a distinct label and use random cropping as data augmentation. Our trained classifier achieves 100% accuracy on a scaly dataset. We use this classifier to evaluate Inception Score for models trained on the same texture dataset.Model  Uncond. IS  Cond. IS 

PSGAN5D  73.680.6  NA 
Our2D  73.740.3  103.960.1 
4.2 Unconditional and Conditional Generation
For models like PSGAN we are not able to obtain reproductions, we only have access to texture generation process. One would ask for the guarantees that a model is able to generate every texture in the dataset from only the prior distribution. We evaluate PSGAN and our model on a scaly dataset with 116 unique textures. After models are trained, we estimate the Inception Score. We observed that Inception Score differs with and thus picked the best separately for both PSGAN and our model obtaining and respectively. Both models were trained with Adam [17] (betas=0.5,0.99) with batch size 64 on a single GPU. Their best performance was achieved in around 80k iterations. For both models, we used spectral normalization to improve training stability [26].
Both models can generate highquality textures from low dimensional space. Our model additionally can successfully generate reproductions for every texture in the dataset. Figure 5 and Table 2 summarise the results for conditional (reproductions) and unconditional texture generation. Figure indicates PSGAN may have missing textures, our model does not suffer from this issue. Inception Score suggests that conditional generation is a far better way to sample from the model. In Figure 4 we provide samples and texture reproductions for trained models. A larger set of samples and reproductions for every texture can be found in Appendix A.1 along with evaluations on braided, honeycomb and striped categories from Oxford Describable Textures Dataset.
4.3 Texture Manifold
Autoencoding property is a nice to have feature for generative models. One can treat embeddings as low dimensional data representations. As shown in section 4.2 our model can reconstruct every texture in the dataset. Moreover, we are able to visualize the manifold of textures since we trained this model with . To compare this manifold to PSGAN, we train a separate PSGAN model with . 2D manifolds near prior distribution for both models can be found in Figure 6
. Our model learns visually better 2D manifold and allocates similar textures nearby. Visualizations for manifolds while training (for different epochs) can be found in Appendix
A.1.4.4 Learning Texture Manifolds from Raw Data
The learned manifold in section 4.3 was obtained from well prepared data. Real cases usually do not have clean data and require either expensive data preparation or unsupervised methods. With minor corrections in data preparation pipeline, our model can learn texture manifolds from raw data such as a collection of highresolution photos. To cope with training texture manifolds on raw data, we suggest to construct in equation 3 with two crops from almost the same location with the stochastic procedure described in Algorithm 1. In Figure 7 we provide a manifold learned from House photo.
4.5 Spatial Embeddings and Texture Detection
As described in sections 4.4 and 4.3, our method can learn descriptive texture manifold from a collection of raw data in an unsupervised way. The obtained texture embeddings may be useful. Consider a large input image , as the first in Figure 1, and the trained and on this image. Note that at the training stage encoder is a fully convolutional network, followed by global average pooling. Applied to asis, the encoder’s output would be ”average” texture embedding for the whole image . Replacing global average pooling by spatial average pooling with small kernel allows to output texture embeddings for each receptive field in the input image . We denote such modified encoder as .
is a tensor with spatial texture embeddings for X. They smoothly change along spatial dimensions as visualized by reconstructing them with generator (described in Appendix C) on the third picture in Figure 1.
One can take a reference patch with a texture (e.g., grass) and find similar textures in image . This is illustrated in the last picture in Figure 1. We picked a patch with grass on it and constructed a heatmap
(10) 
where is Euclidean distance and
in our example. We then interpolated
to the original size of .This example shows that allows using learned embeddings for other tasks that have no relation to texture generation. We believe supervised methods would benefit from adding additional features obtained in an unsupervised way.
4.6 Memory Complexity
In this section, we compare the scalability of DTS and our model with respect to dataset size. We denote the number of parameters as , the dataset size as . The number of parameters of DTS model is^{*}^{*}*we use the official implementation from this github page in file ”Model_multi_texture_synthesis.lua”
(11) 
We should note that DTS depends on which is the size of the whole dataset while the number of unique textures in the dataset can be much more smaller than . Therefore, the method is not scalable to large datasets with duplications. To reduce memory complexity, DTS requires labeling. It will allow the method to find unique textures and set the size of the onehot vector to the number of different texture types. Our model learns textures in an unsupervised way and instead of onehot vector uses a low dimensional representation of textures. In Section 4.4 we show that our method can detect different textures from highresolution image without labeling. It means that our model complexity depends mostly on the number of unique textures in the dataset. The number of parameters of our model (the generator and the encoder) is
(12) 
where is the size of latent vector in our model, which consists of three parts . In experiments, we show that the dimension (, , ) is sufficient to learn 116 unique textures.
For example, let us consider a dataset of size 5000 which contains 100 unique textures with 50 variations per each one. Then for our model will be 26 and the number of parameters will be . Meanwhile, DTS will require and parameters. We see that in this case, our model memory consumption is less by approximately 20 times than DTS.
Permeability  Euler characteristic  Surface area  

Ketton  5.06 0.35  4.68 0.56  3.66 0.73  1.86 0.42  1.85 0.62  7.73 0.18 
Berea  0.49 0.07  0.50 0.12  0.34 0.08  1.36 0.25  0.33 0.11  5.91 0.54 
Doddington  0.42 0.10  3.41 1.68  2.65 2.29  3.35 1.13  4.83 2.06  7.92 0.27 
Estaillades  0.80 0.24  3.41 0.46  1.85 0.29  2.05 1.05  4.62 0.66  6.93 0.39 
Bentheimer  0.47 0.08  1.38 0.49  1.24 0.41  3.44 1.91  1.20 0.73  1.25 0.12 
. The standard deviation was computed using the bootstrap method with
resamples4.7 Application to 3D Porous Media Synthesis
In this section, we demonstrate the applicability of our model to the Digital Rock Physics. We trained our model on 3D Porous Media structures^{*}^{*}*All samples were taken from this site (i.e. see Fig. 7(a)) of five different types: Ketton, Berea, Doddington, Estaillades and Bentheimer. Each type of rock has an initial size binary voxels. As the baseline, we considered Porous Media GANs [27], which is deep convolutional GANs with 3D convolutional layers.
For the comparison of our model with real samples and the baseline samples, we use permeability statistics and two socalled Minkowski functionals [20]. The permeability is a measure of the ability of a porous material to allow fluids to pass through it. Minkowski functionals describe the morphology and topology of 3D binary structures. In our experiments, we used two functionals: Surface area and Euler characteristic. If the considered measures on synthetic samples are close to that on real ones, it will guarantee that the synthetic samples are valid for Digital Rock Physics applications.
We used the following experimental setup. We trained our model on random crops of size on all types of porous structures. We also trained five baseline models on each type separately. Then we generated synthetic samples of size of each type using our model and the baseline model. We also cropped samples of size from the real data. As a result, for each type of structure, we obtained three sets of objects: real, synthetic and baseline.
The visual result of the synthesis is presented in Fig. 8 for Berea. In the figure, there are three samples: real (i.e., cropped from the original big sample), ours and a sample of the baseline model. Other types of porous materials along with architecture details are presented in Appendix E. Because our model is fully convolutional, we can increase the generated sample size by expanding the spatial dimensions of the latent embedding . We demonstrate the synthesized 3D porous media of size in Figure 9. Then,

For each real, synthetic and baseline objects we calculated three statistics: permeability, Surface Area and Euler characteristics.

To measure the distance between distributions of statistics for real, our and baseline samples we approximated these distributions by discrete ones obtained using the histogram method with bins.

Then for each statistic, we calculated KL divergence between the distributions of the statistic of a) real and our generated samples; b) real and baseline generated samples.
The comparison of the KL divergences is presented at Tab. 3 for the permeability and for Minkowski functionals. As we can see, our model performs better accordingly for most types of porous structures.
In this section, we showed the application of our model to Digital Rock Physics. Our model outperforms the baseline in most of the cases what proves its usefulness in solving realworld problems. Moreover, its critical advantage is the ability to generate multiple textures with the same model.
5 Conclusion
In this paper, we proposed a novel model for multitexture synthesis. We showed it ensures full dataset coverage and can detect textures on images in the unsupervised setting. We provided a way to learn a manifold of training textures even from a collection of raw highresolution photos. We also demonstrated that the proposed model applies to the realworld 3D texture synthesis problem: porous media generation. Our model outperforms the baseline by better reproducing physical properties of real data. In future work, we want to study the texture detection ability of our model and seek for its new applications.
Acknowledgements.
Aibek Alanov, Max Kochurov, Dmitry Vetrov were supported by Samsung Research, Samsung Electronics. The work of E. Burnaev and D. Volkhonskiy was supported by The Ministry of Education and Science of Russian Federation, grant No.14.615.21.0004, grant code: RFMEFI61518X0004. The authors E. Burnaev and D. Volkhonskiy acknowledge the usage of the Skoltech CDISE HPC cluster Zhores for obtaining some results presented in this paper.
Appendix
Appendix A Additional Experiments
a.1 Results on Datasets: Scaly, Braided, Honeycomb, Striped
In this appendix section, we provide results for all datasets mentioned in the main text as attached files. The below listing of files is a reference to these files.

Scaly Dataset

scaly_ours_2D_samples.jpg – grid of samples images for our model with

scaly_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

scaly_ours_40D_samples.jpg – grid of samples images for our model with

scaly_ours_40D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

movie2dplaneours.gif – a visualization for the training process of the 2D latent space for our model

movie2dplanepsgan.gif – a visualization for the training process of the 2D latent space for PSGAN model


Braided Dataset

braided_ours_2D_samples.jpg – grid of samples images for our model with

braided_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

braided_ours_40D_samples.jpg – grid of samples images for our model with

braided_ours_40D_recon.jpg – reconstructions for every texture present in the training dataset for our model with


Honeycomb Dataset

honeycomb_ours_2D_samples.jpg – grid of samples images for our model with

honeycomb_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

honeycomb_ours_40D_samples.jpg – grid of samples images for our model with

honeycomb_ours_40D_recon.jpg – reconstructions for every texture present in the training dataset for our model with


Striped Dataset

striped_ours_2D_samples.jpg – grid of samples images for our model with

striped_ours_2D_recon.jpg – reconstructions for every texture present in the training dataset for our model with

a.2 Results on a Collection of Raw Images
In this part, we train our model with for 5 highresolution images (see Figure 10) in a fully unsupervised way. We show that our method learns a descriptive manifold of textures from these images (see Figure 11) which we can use for texture detection. We demonstrate that we can apply this technique for unseen images (see Figure 12).
Appendix B Algorithm Description
The main algorithm for the training pipeline is presented in Algorithm 2 and visualised in Figure 2. Additionally, it is possible to add regularization for embeddings along with
that will increase variance of
distribution. This term does not worsen the results added with weight for , or with for but slightly improves training stability. The improvement is marginal and optional and for this reason we omitted this in the main text. We also did not use this loss term to train models presented in the main text.Appendix C Network Architectures
c.1 Encoder Network Architecture
The architecture of the encoder is similar to the discriminator architecture. The first difference is that batch norm layers are added. The second one is that convolutional layers are followed by global average pooling to obtain a single embedding for an input texture. See Table 4.
Layer  Output size  Parameters 

Input  
Conv2d  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2  
AdaptiveAvgPool2d  output_size=1 
c.2 Generator Network Architecture
The architecture for the generator is taken from the PSGAN model without any changes. See Table 5.
Details on
As mentioned in Section 4.5, we need to modify to obtain ”reconstructed” picture in Figure 1. This is done modifying only Compute Period Coefs part in the generator by replacing Linear layer with Conv1x1 with the same weight matrix. Previously, we had shared period coefficients along spatial dimensions and they were dependent only on one global code . Once we apply (replacing global pooling with spatial pooling in ) to an image, we obtain varying ”global” codes along spatial dimension. Conv1x1 allows to efficiently compute periodic coefficients for every spatial position to obtain . , note, that random offset is manually set to zero. Then global tensor is stacked with and to get that is passed to the Upsampling part in the generator. As the generator is fully convolutional, we are free in an input spatial size.
Layer  Output size  Parameters 
Upsampling part  
Input  
ConvTranspose2d  kernel=5, stride=2, pad=2, output_pad=1, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
ConvTranspose2d  kernel=5, stride=2, pad=2, output_pad=1, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
ConvTranspose2d  kernel=5, stride=2, pad=2, output_pad=1, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
ConvTranspose2d  kernel=5, stride=2, pad=2, output_pad=1, bias=False  
BatchNorm2d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.2  
ConvTranspose2d  kernel=5, stride=2, pad=2, output_pad=1, bias=False  
Tanh  
Compute Period Coefs  
Input  
Linear  
ReLU  
Linear 
c.3 Architectures of Discriminator Networks
c.3.1 Discriminator
The architecture for discriminator is taken from PSGAN model with added spectral norm in it. Spectral norm improves the stability of training. See Table 6.
Layer  Output size  Parameters 

Input  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm 
c.3.2 Discriminator
The proposed architecture for the discriminator on pairs the Convolutional part is same as for except the last number of channels. The output for two images constructs a matrix of size and Conv 1x1 part is applied to this matrix to obtain spatial predictions for each image. This architecture is symmetric with respect to input order and can work with different sized images pairs (we did not require this feature in our algorithm). See Table 7.
Layer  Output size  Parameters 
Convolutional part  
Input  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=5, stride=2, pad=2 + spectral_norm  
Conv 1x1 part  
Input  
Conv2d  kernel=1, stride=1 + spectral_norm  
LeakyReLU  slope=0.2  
Conv2d  kernel=1, stride=1 + spectral_norm 
c.3.3 Discriminator
Following recent works [25] motivated to use adversarial trainign scheme for latent representations. The other benefit from using an additional discriminator is to make loss terms to be at the same scale. See Table 8.
Layer  Output size  Parameters 

Input  
Linear  
LeakyReLU  slope=0.2  
Linear  
LeakyReLU  slope=0.2  
Linear 
Appendix D Hyperparameters
We used the set of hyperparameters to train the model on 116 textures from scaly provided in Table
9 and Table 10.Hyperparameter  Value 

crop size from image  (160, 160) 
batch size  64 
spectral normalization for discriminators  True 
number of steps for discriminator per 1 step of generator  1 
iterations  100000 
latent prior  
2  
20  
4 
Hyperparameter  Value 

initialization for weights  
optimizer  adam 
adam betas  0.5, 0.999 
learning rate  0.0002 
weight decay  0.0001 
Appendix E 3D Porous Media Synthesis
In this section, we describe network architectures, hyperparameters and experiments for the 3D porous media generation.
e.1 Network Architectures
Architectures for 3D porous media synthesis have almost the same structure as for 2D textures. The main differences are the following:

instead of Conv2D (TransposedConv2D) layers we used Conv3D (TransposedConv3D) layers;

we do not use periodical latent component since there is no need in periodicity in porous structures.
In order to honestly compare our model with the baseline, we used the same generator and discriminator networks in both our model and the baseline.
e.1.1 3D Encoder Network Architecture
The architecture of the 3D encoder is presented in Table 11.
Layer  Output size  Parameters 

Input  
Conv3d  kernel=4, stride=1, pad=0  
LeakyReLU  slope=0.01  
Conv3d  kernel=4, stride=2, pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
Conv3d  kernel=4, stride=2, pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
Conv3d  kernel=4, stride=2, pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
Conv3d  kernel=4, stride=2, pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
Conv3d  kernel=1, stride=1, pad=0  
AdaptiveAvgPool3d  output_size=1 
e.1.2 3D Generator Network Architecture
The architecture of the 3D generator is presented in Table 12. The same generator architecture was used in the baseline.
Layer  Output size  Parameters 
Upsampling part  
Input  
ConvTranspose3d  kernel=4, stride=1, pad=0, output_pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
ConvTranspose3d  kernel=4, stride=2, pad=1, output_pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
ConvTranspose3d  kernel=4, stride=2, pad=1, output_pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
ConvTranspose3d  kernel=4, stride=2, pad=1, output_pad=1, bias=False  
BatchNorm3d  eps=1e05, momentum=1.0, affine=True  
LeakyReLU  slope=0.01  
ConvTranspose3d  kernel=4, stride=2, pad=1, output_pad=1, bias=False  
Tanh 
e.1.3 Architectures of the 3D Discriminator Network
The architecture of the 3D texture discriminator is presented in Table 13. The same discriminator architecture was used in the baseline.
Layer  Output size  Parameters 

Input  
Conv3d  kernel=4, stride=2, pad=1 + spectral_norm  
LeakyReLU  slope=0.01  
Conv3d  kernel=4, stride=2, pad=1 + spectral_norm  
LeakyReLU  slope=0.01  
Conv3d  kernel=4, stride=2, pad=1  
LeakyReLU 
