The goal of exemplar-based texture synthesis is to generate texture images that are visually similar to a given exemplar. Recently, promising results have been reported by methods relying on convolutional neural networks (ConvNets) pretrained on large-scale image datasets. However, these methods have difficulties in synthesizing image textures with non-local structures and extending to dynamic or sound textures. In this paper, we present a conditional generative ConvNet (cgCNN) model which combines deep statistics and the probabilistic framework of generative ConvNet (gCNN) model. Given a texture exemplar, the cgCNN model defines a conditional distribution using deep statistics of a ConvNet, and synthesize new textures by sampling from the conditional distribution. In contrast to previous deep texture models, the proposed cgCNN dose not rely on pre-trained ConvNets but learns the weights of ConvNets for each input exemplar instead. As a result, the cgCNN model can synthesize high quality dynamic, sound and image textures in a unified manner. We also explore the theoretical connections between our model and other texture models. Further investigations show that the cgCNN model can be easily generalized to texture expansion and inpainting. Extensive experiments demonstrate that our model can achieve better or at least comparable results than the state-of-the-art methods.READ FULL TEXT VIEW PDF
We model local texture patterns using the co-occurrence statistics of pi...
A dynamic texture (DT) refers to a sequence of images that exhibit tempo...
Currently, Markov-Gibbs random field (MGRF) image models which include
Recent progresses on deep discriminative and generative modeling have sh...
Recently, enthusiastic studies have devoted to texture synthesis using d...
The real world exhibits an abundance of non-stationary textures. Example...
Recently, deep generative adversarial networks for image generation have...
Exemplar-based texture synthesis (EBTS) has been a dynamic yet challenging topic in computer vision and graphics for the past decades[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], which targets to produce new samples that are visually similar to a given texture exemplar. The main difficulty of EBTS is to efficiently synthesize texture samples that are not only perceptually similar to the exemplar, but also able to balance the repeated and innovated elements in the texture.
To overcome this difficulty, two main categories of approaches have been proposed in the literature, i.e., patch-based methods [2, 3, 4, 11] and methods relying on parametric statistical models [1, 5, 6, 7, 12]. Given a texture exemplar, patch-based methods regard small patches in the exemplar as basic elements, and generate new samples by copying pixels or patches from the exemplar to the synthesized texture under certain spatial constrains, such as Markovian property [2, 11, 4]. These methods can produce new textures with high visual fidelity to the given exemplar, but they often result in verbatim copies and few of them can be extended to dynamic texture, except . Moreover, in contrast with their promising performance, they take less steps to understand the underlying process of textures. Whereas statistical parametric methods concentrate on exploring the underlying models of the texture exemplar, and new texture images can then be synthesized by sampling from the learned texture model. These methods are better at balancing the repetitions and innovations nature of textures, while they usually fail to reproduce textures with highly structured elements. It is worth mentioning that a few of these methods can be extended to sound textures  and dynamic ones . Some recent surveys on EBTS can be founded in [16, 9].
Recently, parametric models has been revived by the use ofdeep neural networks [12, 17, 18, 19]
. These models employ deep ConvNets that are pretrained on large-scale image datasets instead of handcrafted filters as feature extractors, and generate new samples by seeking images that maximize certain similarity between their deep features and those from the exemplar. Although these methods show great improvements over traditional parametric models, there are still two unsolved or only partially solved problems: 1) It is difficult to extend these methods to other types of textures, such as dynamic and sound textures, since these methods rely on ConvNets pre-trained on large-scale datasets, such as ImageNet, which are difficult to obtain in video or sound domain. 2) These models can not synthesize textures with non-local structures, as the optimization algorithm is likely to be trapped in local minimums where non-local structures are not preserved. A common remedy is to use extra penalty terms, such as Fourier spectrum or correlation matrix , but these terms bring in extra hyper-parameters and are slow to optimize.
In order to address these problems, we propose a new texture model named conditional generative ConvNet (cgCNN) by integrating deep texture statistics and the probabilistic framework of generative ConvNet (gCNN) 
. Given a texture exemplar, cgCNN first defines an energy based conditional distribution using deep statistics of a trainable ConvNet, which is then trained by maximal likelihood estimation (MLE). New textures can be synthesized by sampling from the learned conditional distribution. Unlike previous texture models that rely on pretrained ConvNets, cgCNNlearns the weights of the ConvNet for each input exemplar. It therefore has two main advantages: 1) It allows to synthesize image, dynamic and sound textures in a unified manner. 2) It can synthesize textures with non-local structures without using extra penalty terms, as it is easier for the sampling algorithm to escape from local minimums.
We further present two forms of our cgCNN model, i.e. the canonical cgCNN (c-cgCNN) and the forward cgCNN (f-cgCNN), by exploiting two different sampling strategies. We show that these two forms of cgCNN have strong theoretical connections with previous texture models. Specifically, c-cgCNN uses Langevin dynamics for sampling, and it can synthesize highly non-stationary textures. While f-cgCNN uses a fully convolutional generator network as an approximated fast sampler, and it can synthesize arbitrarily large stationary textures. We further show that Gatys’ method  and TextureNet  are special cases of c-cgCNN and f-cgCNN respectively. In addition, we derive a concise texture inpainting algorithm based on cgCNN, which iteratively searches for a template in the uncorrupted region and synthesizes a texture patch according to the template.
Our main contributions are thus summarized as follows:
We propose a new texture model named cgCNN which combines deep statistics and the probabilistic framework of gCNN model. Instead of relying on pretrained ConvNets as previous deep texture models, the proposed cgCNN learns the weights of the ConvNet adaptively for each input exemplar. As a result, cgCNN can synthesize high quality dynamic, sound and image textures in a unified manner.
We present two forms of cgCNN and show their effectiveness in texture synthesis and expansion: c-cgCNN can synthesize highly non-stationary textures without extra penalty terms, while f-cgCNN can synthesize arbitrarily large stationary textures. We also show their strong theoretical connections with previous texture models. Note f-cgCNN is the first deep texture model that enables us to expand dynamic or sound textures.
We present a simple but effective algorithm for texture inpainting based on the proposed cgCNN. To our knowledge, it is the first neural algorithm for inpainting sound textures.
Extensive experiments111All experiments can be found at captain.whu.edu.cn/cgcnn-texture. in synthesis, expansion and inpainting of various types of textures using cgCNN. We demonstrate that our model achieves better or at least comparable results than the state-of-the-art methods.
The rest of this paper is organized as follows: Sec. II reviews some related works. Sec. III recalls four baseline models. Sec. IV details cgCNN’s formulation and training algorithm, and provides some theoretical analysis to the models. Sec. V uses cgCNN for the synthesis of various types of textures and adapts the synthesis algorithm to texture inpainting. Sec. VI presents results that demonstrate the effectiveness of cgCNN in synthesizing, expanding and inpainting all three types of textures. Sec. VII draws some conclusions.
One seminal work on parametric EBTS was made by Heeger and Bergen , who proposed to synthesize textures by matching the marginal distributions of the synthesized and the exemplar texture. Subsequently, Portilla et al.  extended this model by using more and higher-order measurements. Another remarkable work at that time was the FRAME model proposed by Zhu et al. , which is a framework unifying the random field model and the maximum entropy principle for texture modeling. Other notable works include [7, 8, 21]. These methods built solid theoretical background for texture synthesis, but are limited in their ability to synthesize structured textures.
Recently, Gatys  made a breakthrough in texture modelling by using deep neural networks. This model can be seen as an extension of Portilla’s model , where the linear filters was replaced by a pretrained deep ConvNet. Gatys’ method was subsequently extended to style transfer , where the content image was force to have similar deep statistics with the style image. In more recent works, Gatys’ method has been extended to synthesizing textures with non-local structures by using more constraints such as correlation matrix  and spectrum . However, such constraints bring in extra hyper-parameters that require manual tuning, and are slow to optimize  or cause spectrum like noise . In contrast, our model can synthesize non-local structures without the aid of these constraints due to the effective sampling strategy. In order to accelerate the synthesis process and synthesize larger textures than the input, Ulyanov et al.  and Johnson et al.  proposed to combine a fully convolutional generator with Gaty’s model, so that textures can be synthesized in a fast forward pass of the generator. Similar to Ulyanov’s model et al. , our model also uses a generator for fast sampling and texture expansion. In contrast to Gatys’ method which relies on pretrained ConvNets, Xie  proposed a generative ConvNet (gCNN) model that can learn the ConvNet and synthesize textures simultaneously. In subsequent works, Xie  proposed CoopNet by combining gCNN and a latent variable model. This model was latter extended to video  and 3D shape  synthesis. Our model can be regarded as a combination of Gatys’ method and gCNN, as it utilizes the idea of deep statistics in Gatys’ method and the probabilistic framework of gCNN.
compared these method quantitatively by studying the synthesizability of the input exemplars. Recent works leveraged deep learning techniques for synthesizing dynamic textures. For instance, Tesfaldetet al.  proposed to combine Gatys’ method  with an optic flow network in order to capture the temporal statistics. In contrast, our model does not require the aid of other nets, as our model is flexible to use spatial-temporal ConvNets for spatial and temporal modelling.
As for sound texture synthesis, classic models  are generally based on wavelet framework and use handcrafted filters to extract temporal statistics. Recently, Antognini et al.  extended Gatys’ method to sound texture synthesis by applying a random network to the spectrograms of sound textures. In contrast, our model learns the network adaptively instead of fixing it to random weights, and our model is applied to raw waveforms directly.
The texture inpainting problem is a special case of image or video inpainting problem, where the inpainted image or video is assumed to be a texture. Igehy  transferred Heeger and Bergen’s texture synthesis algorithm  to an inpainting algorithm. Our inpainting algorithm shares important ideas with Igehy’s method , as we also adopt an inpainting by synthesizing scheme. Other important texture inpainting methods include conditional Gaussian simulation  and PatchMatch based methods [32, 33].
This section recalls several baseline models on which cgCNN is built. The theoretical connections between these model and cgCNN will be discussed in Sec. IV.
Given a RGB-color image texture exemplar , where and are the height and width of the image, texture synthesis targets to generate new samples that are visually similar to .
Gatys’ method uses a pretrained deep ConvNet as a feature extractor. For an input texture exemplar, the Gram matrices of feature maps at selected layers are first calculated. New texture samples are then synthesized by matching the Gram matrices of the synthesized textures and the exemplar.
Formally, Gatys’ method tries to solve the following optimization problem:
The objective function is defined as:
where is a pretrained ConvNet, is the feature map at layer , and is the Frobenius norm. is the Gram matrix defined as:
where is a feature map with channels and elements in each channel.
This model is trained by gradient descent using back propagation. Each step follows
where is the learning rate.
TextureNet is a forward version of Gatys’ method. It learns a generator network with trainable weights , which maps a sample of random noise to a local minimum of Eqn. (2). This amounts to solve the following optimization problem:
is trained by gradient decent with approximate gradients:
where are samples from .
gCNN is defined on a more general setting. It aims to estimate the underlying distribution of a set of images and generate new images by sampling from this distribution. In our work, we only consider the specific case where the input set contains only one image , i.e. , and is a stationary texture exemplar.
gCNN defines a distribution of in image space:
where is the normalization factor. is the energy function defined by
where is the output of a ConvNet with learnable weights . gCNN is trained by maximum likelihood estimation (MLE).
CoopNet extends gCNN by combining gCNN with a latent variable model  which takes the form of
where is a forward ConvNet parametrized by and is the synthesized image.
The is trained by MLE, which is to iterate the following four steps:
Generate samples using random .
Feed to gCNN, run steps of Langevin dynamics for : .
Run steps of Langevin dynamics for : .
Update using gradient descent: =
In this section, we first present the definition of our conditional generative ConvNet (cgCNN) model, and then explore two forms of cgCNN, i.e. the canonical cgCNN (c-cgCNN) and the forward cgCNN (f-cgCNN). Finally, we conclude this section with some theoretical explanations of cgCNN.
Let represent an image, dynamic or sound texture exemplar, and note that the shape of depends on its type. Specifically, represents a dynamic texture exemplar; represents an image texture exemplar, and represents a sound texture exemplar, where and are spatial and temporal sizes.
Given a texture exemplar , cgCNN defines a conditional distribution of synthesized texture :
where is the normalization factor. is the energy function which is supposed to capture the visual difference between and by assigning lower values to ’s that are visually closer to . As an analogue to , we define by
where is a deep network with learnable weight , and is the feature maps at the -th layer. is a statistic measurement, such as e.g. Gram matrix defined in Eqn. (3
). We also test spatial mean vectoras an alternative measurement in our experiment section. For simplicity, in the rest of this paper, we denote by when the meaning is clear from the text.
The objective of training cgCNN is to estimate the conditional distribution using only one input data . This is achieve by minimizing the KL divergence between the empirical data distribution, which is a Kronecker delta function , and the estimated distribution . The KL divergence can be written as:
where denotes the entropy and .
Note that minimizing is equivalent to MLE, where the log-likelihood is defined as the log-likelihood of the input given itself as the condition:
For the consistency of notation, in the rest of this paper, we use instead of as the objective function.
The gradient of can be written as follows:
Note that the expectation term in Eqn. (14) is analytical intractable, and has to be approximated by the Monte Carlo method. Suppose we have samples drawn from , the gradient of can be approximated as:
We can then minimize using gradient decent according to Eqn. (15).
Therefore, the key of training cgCNN is sampling from . We use 1) Langevin dynamics and 2) a generator net for sampling, which lead to c-cgCNN and f-cgCNN respectively.
c-cgCNN uses Langevin dynamics to sample from . Specifically, starting from a random noise , it uses the following rule to update :
where is a sample at step , is the step size, and is a Gaussian noise. A training algorithm for c-cgCNN can be derived by combining Langevin sampling in Eqn. (16) and approximated gradient in Eqn. (15). Starting from a random noise , the algorithm iteratively goes through -learning step and Langevin sampling step:
Langevin sampling: draw samples using Langevin dynamics according to Eqn. (16).
-learning: update network using approximated gradient according to Eqn. (15).
The detailed training process is presented in Alg. 1.
The Langevin dynamics used in c-cgCNN is slow, and may be the bottleneck of the Alg. 1. As an alternative, we may also use a generator net as a fast approximated sampler of . Specifically, we introduce a generator network with learnable weights
, which maps the normal distributionto a parametrized distribution . The training object is to match and , so that samples of can be approximated by samples of . In other words, when is trained, approximated samples of can be drawn by forwarding a noise through network , which is much faster than Langevin dynamics in Eqn. (16). Formally, network is trained by minimizing the KL divergence between and :
The first term in Eqn. (IV-B2) is the entropy of distribution , which is analytical intractable. Following TextureNet , we use Kozachenko-Leonenko estimator  (KLE) to approximate this term. Given samples drawn from , KLE is defined as:
The second term in Eqn. (IV-B2) is an expectation of our energy function . It can be approximated by taking average over a batch of samples of .
Now, since both terms in Eqn. (IV-B2) can be approximated, the gradient of can be calculated as:
where are samples drawn from .
The complete training algorithm of f-cgCNN can be derived by training network and jointly. Formally, the goal is to match three distributions: , and by optimizing the following objective function,
To achieve this goal, f-cgCNN is trained by iteratively going through the following three steps:
The detailed algorithm is presented in Alg. 2.
We present some theoretical understandings of cgCNN by relating it to other neural models. We first point out cgCNN is conceptually related to GAN  as it can be written in a min-max adversarial form. Then we show that: 1) c-gCNN and f-cgCNN are generalizations of Gatys’ method  and TextureNet  respectively. 2) c-cgCNN is a variation of gCNN  with extra deep statistics, and the forward structures in f-cgCNN and CoopNet are consistent. The main properties of these models are summarized in Tab. I.
The adversarial form of f-cgCNN can be written as:
This adversarial form has an intuitive explanation: network tries to synthesize textures that are more visually similar to the input exemplar, and network tries to detect the differences between them. The training process ends when the adversarial game reaches an equilibrium. Similarly, we have the min-max form of c-cgCNN:
where the synthesized texture plays the role that is played by in f-cgCNN.
It is easy to see that c-cgCNN is a generalization of Gatys’ method with an extra step to learn the network . Because if we fix network to be a pretrained ConvNet with weights in Eqn. (22), c-cgCNN becomes , which is exactly Gatys’ method defined in Eqn. (1). Furthermore, since f-cgCNN and TextureNet are built on c-cgCNN and Gatys’ method respectively, and they use the same forward structures, we can conclude that f-cgCNN is a generalization of TextureNet as defined in Eqn. (5). In summary, we have the following proposition:
Gatys’ method and TextureNet are special cases of c-cgCNN and f-cgCNN respectively, where the net is fixed to be a pretrained ConvNet.
Comparing to Gatys’ method, samples of c-cgCNN are less likely to be trapped in local minimums for too long, because the -learning step always seeks to increase the energy of current samples. For example, if is a local minimal at step , the subsequent -learning step will increase ’s energy, thus the energy of may be higher than its neighborhood at the beginning of step , and the Langevin steps will sample different from . In our experiments, we find this property enables us to synthesize highly structured textures without extra penalty terms.
Unlike TextureNet and Gatys’ method, both c-cgCNN and f-cgCNN can synthesize other types of textures besides image texture, because they do not rely on pretrained ConvNets. In addition, thanks to the their forward structures, both f-cgCNN and TextureNet can synthesize textures that are larger than the input.
In general, c-cgCNN can be regarded as a variation of gCNN in texture synthesis. It should be noticed that the energy defined in gCNN dose not involve any deep statistics, hence it can be used to synthesis both texture and non-texture images, such as human faces. However, the energy defined in cgCNN incorporates deep statistics (Gram matrix or mean vector) specifically designed for texture modelling, hence it is more powerful in texture synthesis but can not handle non-texture images.
CoopNet uses a latent variable model as the forward structure to accelerate the Langevin dynamics in gCNN. Note that the forward structures in CoopNet and f-cgCNN are consistent, as they both seek to learn the distribution defined by their respective backward structures,i.e. gCNN and cgCNN. Furthermore, they are equivalent in a special setting as stated in the following proposition.
If we 1) disable all noise term in Langevin dynamics, 2) set in CoopNet, and 3) discard the entropy term in f-cgCNN, the forward structures in CoopNet and f-cgCNN become equivalent.
In this setting, denote the output of the latent variable model in gCNN as , then the target is defined as step Langevin dynamics starting from , i.e. . Training amounts to minimize the objective function via gradient descent. Note the gradient of the objective function can be calculated as , which is exactly back-propagation for minimizing
according to the chain rule. Because the generator net in f-cgCNN is also trained using back-propagation, it is clear that the forward structures in CoopNet and f-cgCNN are equivalent.
All of cgCNN, CoopNet and gCNN can synthesize various types of textures. However, unlike f-cgCNN whose synthesis step is a simple forward pass, the synthesis step of CoopNet involves several Langevin steps of gCNN, it is therefore difficult to expand textures using CoopNet.
|Model||Forward structure||Backward structure||Multi-scale statistics||Dynamic texture synthesis||Sound texture synthesis||Texture expansion||Fast sampling|
|Gatys’ ||pretrained ConvNet||✓||✗||✗||✗||✗|
|TextureNet ||generator||pretrained ConvNet||✓||✗||✗||✓||✓|
|CoopNet ||latent variable model||gCNN||✗||✓||✗||✓|
In our model, we use the same training algorithms described in Alg. 1 and Alg. 2 and statistics (Gram matrix and mean vector) for all types of textures. Therefore, in order to synthesize different types of textures, we only need to modify the network dimensions accordingly, and all other settings remain the same.
Dynamic textures can be regarded as image textures with an extra temporal dimension. Therefore, we simply use 3-dimensional spatial-temporal convolutional layers in cgCNN to capture the spatial appearances and the temporal dynamics simultaneously. In other words, unlike the methods [28, 26] that model spatial and temporal statistics independently, our model treats them equally by regarding a clip of dynamic texture as a spatial-temporal volume, in which both the spatial and the temporal dimensions are stationary.
Sound textures can be regarded as a special case of dynamic textures, where spatial dimensions are not considered. However, modelling sound texture is not a simple task, because the sampling frequency of sound textures (10 kHz) is usually far higher than that of dynamic textures (10 Hz). As a result, sound textures show more complicated long-range temporal dependencies and multi-scale structures than dynamic textures.
In our model, we simply use 1-dimensional temporal convolutional layers in cgCNN to extract temporal statistics. We use atrous  convolutions to ensure large receptive fields, which enable us to learn long-range dependencies. Unlike Antognini’s model  which applies fixed random ConvNets to the spectrograms, our model learns the ConvNet using raw waveforms.
As a proof of concept, we present a simple algorithm for texture inpainting based on our texture synthesis algorithm described in Alg. 1.
Given an input texture with a corrupted region , the texture inpainting problem is to fill so that the inpainted texture appears as natural as possible. In other words, must be visually close to at least one patch in the uncorrupted region , where is the corrupted region with its border.
Our texture synthesis algorithm described in Alg. 1 can be easily generalized to a texture inpainting algorithm, which iteratively searches for a template in and updates according to the template. Specifically, our method iterates a searching step and a synthesis step. In the searching step, we first measure the energy between and all candidate patches . Then we select the patch with the lowest energy to be the template. In the synthesis step, we update according to template using Alg. 1. It is obvious that this algorithm ensures the inpainted region is visually similar to at least one patch (e.g. the template) in the uncorrupted region.
In the searching step, we use grid search to find the template . Note the template can also be assigned by the user . It is possible to replace the grid search by more advanced searching techniques such as PatchMatch , and use gradient penalty  or partial mask  to ensure a smooth transition near the border of . However, these contradict the purpose of this algorithm, which is to show the effectiveness of the proposed c-cgCNN method by combining it with other simplest possible methods.
The detailed inpainting algorithm is presented in Alg. 1.
In this section, we evaluate the proposed cgCNN model and compare it with other texture models. We first perform self evaluations of c-cgCNN in Sec. VI-B-Sec. VI-E. Specifically, we investigate several key aspects of c-cgCNN including the influence of bounded constraints and the diversity of synthesis results. We also carry out two ablation studies concerning the network structure and the training algorithm respectively. Then we evaluate the performance of c-cgCNN and f-cgCNN in texture synthesis and expansion by comparing them with other theoretically related or the state-of-the-art methods in Sec. VI-F-Sec. VI-G. We finally evaluate our texture inpainting method in Sec. VI-H.
The image exemplars are collected from DTD dataset  and Internet, and all examples are resized to . The dynamic texture exemplars are adopted from , where each video has 12 frames ane each frame is resized to . We use sound textures that were used in , which are recorded at . For our experiments, we clip the first sample points (about 2 seconds) of each audio as exemplars.
The network used in cgCNN is shown in Fig. 1
. It consists of a deep branch and a shallow branch. The deep branch consists of convolutional layers with small kernel size focusing on details in textures, and the shallow branch consists of three convolutional layers with large kernel size focusing on larger-scale and non-local structures. The combination of these two branches enables cgCNN to model both global and local structures. When synthesizing dynamic or sound textures, we use spatial-temporal or temporal convolutional layers respectively. We use the hard sigmoid function as activation function in the network, which is defined as:
textures in each iteration and each sample is initialized as Gaussian noise with variance. We run or Langevin steps in each iteration. The training algorithm stops with a maximal of
iterations. We use RMSprop or Adam  to update networks and synthesized images, with the initial learning rate set to . In all our experiments, we follow these settings except where explicitly stated.
All the results are available at http://captain.whu.edu.cn/cgcnn-texture/, where one can check dynamic and sound textures.
We find it is crucial to constrain the magnitude of the energy in order to stabilize the training process, because the energy
often grows too large and causes the exploding gradient problem. In this work, we use bounded activation function in the network architecture to ensure the energyis upper bounded.
We notice the choice of activation function has subtle influences on the synthesis results. This is shown in Fig. 2, where we present the results using different activation functions, i.e.hard sigmoid, tanh and sigmoid respectively. We observe the use of hard sigmoid produces the most satisfactory results, while tanh often generates some unnatural colors, and the results using sigmoid exhibit some check-board artifacts.
It is important for a texture synthesis algorithm to be able to synthesis diversified texture samples using a given exemplar. For the proposed c-cgCNN model, the diversity of the synthesized textures is a direct result of the randomness of the initial Gaussian noise, thus one does not need to make extra effort to ensure such diversity. This is shown in Fig. 3, where a batch of three synthesized samples for each exemplars are presented. Note that all synthesized textures are visually similar to the exemplars, but they are not identical to each other.
In order to verify the importance of -learning step in Alg. 1, we test a fixed random method which is to disable -learning step. This fixed random method is actually optimizing the synthesized image using a fixed random ConvNet.
Fig. 4 presents the comparison between our Alg. 1 and such fixed random method. Clearly, our method produces more favorable results than this fixed random method, as our results are sharper and clearer while this method can only produce blurry and noisy textures. We can therefore conclude that -learning step is key to the success of our algorithm, as it enables us to learn better deep filters than a random ConvNet.
In order to investigate the roles played by different layers in our network in Fig. 1, we carry out an ablation study by using different sub-networks. Note the original network has two branches consisting of deep layers and shallow layers respectively. We denote a sub-network with deep layers and shallow layers by . For instance, a sub-network consists of the first layers in the deep branch and the first layer in the shallow branch. We experiment with and . Fig. 5 presents the results of five sub-networks with increasingly large receptive field, i.e. . As we can see, the synthesized textures capture larger scale structures as the receptive field increases.
In general, to generate high fidelity samples, the network must be able to model structures of different scales contained in the input image. As shown in Fig. 5, generates results with serious artifacts because the receptive field is only pixels wide, which is too small for any meaningful texture elements. For the porous texture which consists of small scale elements, a sub-network with a relatively small receptive field, e.g. and , is sufficient to produce high quality textures. However, for textures containing larger-scale structures, like cherries and pebbles, larger receptive fields are often required for producing better results.
For image texture synthesis, we compare the following methods, which are theoretically related to our model or reported state-of-the-art performance.
c-cgCNN-Gram: Our c-cgCNN with the Gram matrix as the statistic measurement.
c-cgCNN-Mean: Our c-cgCNN where the mean vector is used as the statistic measurement instead of Gram matrix.
Gatys’ method : A texture model relying on pretrained ConvNets. It is a special case of our c-cgCNN-Gram model with pretrained ConvNet.
Self tuning : A recent patch-based EBTS algorithm that utilizes optimization technique.
shows the qualitative comparison of these algorithms. We observe that Gatys’ method fails to capture global structures (the 3-rd and 4-th textures) because the optimization process converges to a local minimum where global structures are not preserved, and it also generates artifacts such as unnatural color and noises (Zoom in the 1-st and 5-th textures). Meanwhile, although gCNN and CoopNet can capture most of the large-scale structures, they loss too many details in the results, probably because they do not use any deep statistics. Self tuning is excel at generating regular textures (the 3-rd and 4-th textures), but it sometimes losses the global structures (the 1-st, 2-nd and 4-th textures) due to the lack of global structure modeling in this method. In contrast, c-cgCNN-Gram and c-cgCNN-Mean can both produce better samples than other baseline methods, since they not only capture large-scale structures but also reproduce small-scale details, even for highly structured textures (1-st, 3-rd and 4-th textures). This is because c-cgCNN use both deep statistics and effective sampling strategy that are not likely to be trapped in bad local minimums. It is also worth noticing that the results of c-cgCNN-Gram and c-cgCNN-Mean are comparable in most cases even though they use different statistics. For quantitative evaluation, we measure multi-scale structural similarity
(MS-SSIM) between the synthesized texture and the exemplar. A higher score indicate higher visual similarity. The quantitative results are summarized in Tab.II. The results show that our methods outperform other baseline methods in most cases.
For dynamic texture synthesis, we use the network , and we sample dynamic textures in sampling step. Fig. 7 presents the qualitative comparison between c-cgCNN method and recent advanced two-stream method . We notice the results of two-stream model suffer from artifacts such as greyish (the 1-st texture) or low level noise (the 2-nd texture), and sometime exhibit temporal inconsistency. While the results of both c-cgCNN-Gram and c-cgCNN-Mean are more favorable as they are cleaner and show better temporal consistency.
For qualitative evaluation, we measure the average of MS-SSIM metric between each frame of synthesized results and the corresponding frame in the exemplar. The results are shown in Tab. III, where both c-cgCNN-Mean and c-cgCNN-Gram outperform two-stream method.
For sound texture synthesis, we use the network where the kernel size and number of filters in each layer are and
, and the strides in each layer isexcept the first layer where the stride is . We do not use pooling layers in this network.
in waveforms. Unlike other two methods which act on frequency domain, c-cgCNN only uses raw audios. We observe that the results of these methods are generally comparable, except for some cases where our results are noisier than baseline methods. It is probably because of the loss of short temporal dependencies caused by the large strides in the shallow layers. It also suggests that our results might be further improved by using more carefully designed networks.
The structure of generator net used in f-cgCNN is borrowed from TextureNet, with two extra residual blocks at the output layer. See  for details. When expanding dynamic or sound texture, the spatial convolutional layers in is replaced by spatial-temporal or temporal convolutional layers accordingly.
Fig. 9 presents a comparison between f-cgCNN and TextureNet in image texture expansion. The results of f-cgCNN and TextureNet are generally comparable, because both of them are able to learn the stationary elements in the exemplars. In addition, f-cgCNN are generally slower to converge than TextureNet in the training phase because it trains an extra net , but their synthesis speed is the same as their synthesis both involve a forward pass through the generator net.
Fig. 10 presents the results of dynamic texture expansion using f-cgCNN. The exemplar dynamic texture is expanded to 48 frames and the size of each frame is . We observe that f-cgCNN successfully reproduces stationary elements and expands the exemplar dynamic texture in both temporal and spatial dimensions, i.e. the synthesized textures have more frames and each frame is larger than the input exemplars. It should be noticed that f-cgCNN is the first neural texture model that enables us to expand dynamic textures.
Fig. 11 presents the results of sound texture expansion. In this experiment, we clip the first 16384 data points (less than 1 second) in each sound texture as exemplars, and expand the exemplar to 122880 data points (about 5 seconds) using f-cgCNN. Similar to the case of dynamic texture expansion, f-cgCNN successfully expands the exemplar sound texture while preserving sound elements that occur most frequently. Notice f-cgCNN is also the first texture model that enables us to expand sound textures.
For image texture inpainting, we evaluate our algorithm by comparing it with the following two deep image inpainting methods:
Deep prior : An inpainting algorithm that utilizes the prior of a random ConvNet. This method dose not require extra training data.
Deep fill : The state-of-the-art image inpainting algorithm. It requires extra training data, and we use the model pretrained on ImageNet for our experiment.
We use the network where the number of channels is 64. We first prepare a rectangle mask of size near the center of a image, then we obtain the corrupted texture by applying the mask to a raw texture, i.e. all pixels within the masked area are set to zero. The border width is set to 4 pixels. All inpainting methods have access to the mask and the corrupted texture, but do not have access to the raw textures.
Fig. 12 presents the qualitative comparison. In general, although the baseline methods can handle textures with non-local structures relatively well (the 1-st texture), they can not handle random elements in textures (from the 2-nd to the 5-th textures). Most results of baseline methods are blurry, and the results of deep fill sometimes show obvious color artifacts (the 1-st and 5-th textures). Clearly, our method outperforms other baseline methods, as it is able to inpaint all corrupted exemplars with convincing textural content, and does not produce blurry or color artifacts.
Tab. IV presents the quantitative comparison. We calculate the MS-SSIM score between the inpainted textures and the corresponding raw textures (not shown). A higher score indicates a better inpainting result. It can be seen that our method outperforms other baseline methods in most cases.
For dynamic texture inpainting, we prepare a mask of size , and apply this mask to each frame of dynamic textures. The border width is set to 2 pixels. We use the network where the number of channels is reduced to 32. The template is assigned by the user because the grid search may cause memory overhead for GPUs. For sound texture inpainting, the mask covers the interval from 20000-th to 30000-th data point. The border width is set to 1000 data points. We use the same network settings as in the sound texture synthesis experiment.
Fig. 13 and Fig. 14 present the results of dynamic texture and sound texture inpainting using our method. Similar to the case of image inpainting, we observe that our method successfully fills the corrupted region with convincing textural content, and the overall inpainted textures are natural and clear. It should be noticed that our proposed method is the first neural algorithm for sound texture inpainting.
In this paper, we present cgCNN for exemplar-based texture synthesis. Our model can synthesize high quality image texture, dynamic texture and sound textures in a unified manner. The experiments demonstrate the effectiveness of our model in texture synthesis, expansion and inpainting.
There are several issues need further investigations. We notice that one limitation of cgCNN is that it cannot synthesis dynamic patterns without spatial stationarity, such as the ones studied in . Extending cgCNN to those dynamic patterns would be an interesting direction for further work. Another limitation is that current cgCNN can not learn multiple input textures, i.e
., it can only learn one texture a time. Future works should extend cgCNN to the batch training setting, and explore its potential in down-stream tasks such as texture feature extraction and classification .
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inECCV. Springer, 2016, pp. 694–711.
T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, 2012.